Home Credit Default Risk (HCDR)¶

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges¶

  1. Dataset size
    • (688 meg compressed) with millions of rows of data
    • 2.71 Gig of data uncompressed
  • Dealing with missing data
  • Imbalanced datasets
  • Summarizing transaction data

Kaggle API setup¶

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library
  • Create a API Token (edit your profile on Kaggle.com); this produces kaggle.json file
  • Put your JSON kaggle.json in the right place
  • Access competition files; make submissions via the command (see examples below)
  • Submit result

For more detailed information on setting the Kaggle API see here and here.

Team and Plan updates¶

Team name: FP_Group22

Team members:

  1. Aarushi Dua (aarudua@iu.edu)
  2. Sai Teja Burla (saburla@iu.edu)
  3. Lakshay Madaan (lmadaan@iu.edu)
  4. Shyam Makwana (smakwana@iu.edu)

Phase leadership plan:

Phase

Phase Leader

Tasks

Phase 1

Shyam Makwana

Project proposal, Layout, Notebook, Credit assignment plan, Gantt chart, Dataset

Phase 2

Aarushi Dua

Grab data, EDA, Metrics, Baseline models, Baseline pipeline, Brief report, Presentation and script

Phase 3

Sai Teja Burla

Feature engineering, Hyperparameter tuning, Additional feature selection, ensemble methods, Presentation and script

Phase 4

Lakshay Madaan

Neural Network, Advanced model architectures, Loss functions, Final project presentation and report



Credit assignment plan:

Phase

Member Name

Task Name

Description














Phase 1

Aarushi Dua

Sai Teja Burla

Shyam Makwana

Lakshay Madaan

Problem Understanding

Comprehending the problem and understanding the provided data 

Shyam Makwana



Assigning Tasks

Assigning tasks to each of the team members 

Phase Planning

Planning all the phases

Data Description

Describing what information each of the given csv files have with a sample of their content 

Sai Teja Burla

Abstract

Brief description of the problem statement at hand

Researched ML models and metric

Description of ML Models and Evaluation Metrics 

Block Diagram

Visual representation of the pipeline

Aarushi Dua

Model Pipeline

Explained the steps to perform during the course of the project

Gantt Chart

Summarize the timeline of the project using gantt charts

Aarushi Dua

Sai Teja Burla

Credit Assignment Plan

Design the four phases and describe the task required.

Lakshay Madaan

Environmental Setup

Connected Data with docker

Data Preparation

Checked Dataset are loading and performing basic data visualization.







Phase 2

Aarushi Dua

Sai Teja Burla

Data Visualization and EDA

Visualize the dataset to gather insights and perform different data pre-processing techniques.

Lakshay Madaan

Shyam Makwana

Feature Engineering and Pipeline Creation

Data Transformation is required and will add/remove necessary features. 

Along with creating a pipeline for numeric and categorical features.

Aarushi Dua

Model Training

Fit the training data on various ML algorithms using pipelines

Sai Teja Burla

Model Evaluation

Apply different evaluation metrics to get the accuracy and loss

Shyam Makwana

Research about different Hyperparameters

Gather hyperparameters of each model applied into the GridsearchCV

Lakshay Madaan

Comparing Baseline Models

Compare the results of each baseline model.





Phase 3

Sai Teja Burla

Feature Importance

Figure out the importance of each features used

Aarushi Dua

Tuning Hyperparameters

Hyperparameter tuning is performed again to improve the performance.

Shyam Makhwana

Visualizing different algorithm’s results

Plot bars and charts to display the results of each model

Lakshay Madaan

Feature Engineering

Any further manipulation in features if required




Phase 4

Lakshay Madaan

Final Model Evaluation

Perform final model evaluation on the best model achieved.

Shyam Makhwana

Report Writing and Video Editing

Write report for final submission and record the video presentation

Sai Teja Burla

Aarushi Dua

Adding Multilayer Perceptron

Introduce advance ML model; i.e Pytorch model for further improvement in model’s accuracy

In [ ]:
!pip install kaggle
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: kaggle in /usr/local/lib/python3.9/dist-packages (1.5.13)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.9/dist-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.9/dist-packages (from kaggle) (2022.12.7)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.9/dist-packages (from kaggle) (2.8.2)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.9/dist-packages (from kaggle) (1.26.15)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.9/dist-packages (from kaggle) (8.0.1)
Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from kaggle) (4.65.0)
Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from kaggle) (2.27.1)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.9/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle) (3.4)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle) (2.0.12)
In [ ]:
!pwd
/content
In [ ]:
!pwd
/content
In [ ]:
!ls -l kaggle.json
-rw-r--r-- 1 root root 69 Apr 26 01:44 kaggle.json
In [ ]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
In [ ]:
! kaggle competitions files home-credit-default-risk
name                                 size  creationDate         
----------------------------------  -----  -------------------  
installments_payments.csv           690MB  2019-12-11 02:55:35  
bureau_balance.csv                  358MB  2019-12-11 02:55:35  
bureau.csv                          162MB  2019-12-11 02:55:35  
application_test.csv                 25MB  2019-12-11 02:55:35  
HomeCredit_columns_description.csv   37KB  2019-12-11 02:55:35  
POS_CASH_balance.csv                375MB  2019-12-11 02:55:35  
application_train.csv               158MB  2019-12-11 02:55:35  
credit_card_balance.csv             405MB  2019-12-11 02:55:35  
sample_submission.csv               524KB  2019-12-11 02:55:35  
previous_application.csv            386MB  2019-12-11 02:55:35  

Dataset and how to download¶

Back ground Home Credit Group¶

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group¶

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset¶

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview¶

The HomeCredit_columns_description.csv acts as a data dictioanry.

There are 7 different sources of data:

  • application_train/application_test (307k rows, and 48k rows): the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid. The target variable defines if the client had payment difficulties meaning he/she had late payment more than X days on at least one of the first Y installments of the loan. Such case is marked as 1 while other all other cases as 0.
  • bureau (1.7 Million rows): data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
  • bureau_balance (27 Million rows): monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
  • previous_application (1.6 Million rows): previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
  • POS_CASH_BALANCE (10 Million rows): monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
  • credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
  • installments_payment (13.6 Million rows): payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.

Table sizes¶

name                       [  rows cols]     MegaBytes         
-----------------------  ------------------  -------
application_train       : [  307,511, 122]:   158MB
application_test        : [   48,744, 121]:   25MB
bureau                  : [ 1,716,428, 17]    162MB
bureau_balance          : [ 27,299,925, 3]:   358MB
credit_card_balance     : [  3,840,312, 23]   405MB
installments_payments   : [ 13,605,401, 8]    690MB
previous_application    : [  1,670,214, 37]   386MB
POS_CASH_balance        : [ 10,001,358, 8]    375MB

dataset

Downloading the files via Kaggle API¶

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.
In [ ]:
DATA_DIR = "../data/home-credit-default-risk"   #same level as course repo in the data directory
#DATA_DIR = os.path.join('./ddddd/')
!mkdir ../data ../data/home-credit-default-risk
In [ ]:
!ls -l {DATA_DIR}
total 0
In [ ]:
! kaggle competitions download home-credit-default-risk -p $DATA_DIR
Downloading home-credit-default-risk.zip to ../data/home-credit-default-risk
100% 688M/688M [00:23<00:00, 34.1MB/s]
100% 688M/688M [00:23<00:00, 31.1MB/s]
In [ ]:
!pwd 
/content
In [ ]:
!ls -l $DATA_DIR
total 704708
-rw-r--r-- 1 root root 721616255 Apr 26 01:45 home-credit-default-risk.zip
In [ ]:
#!rm -r  DATA_DIR

Imports¶

In [ ]:
import numpy as np
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
In [ ]:
unzippingReq = True #True
if unzippingReq: #please modify this code 
    zip_ref = zipfile.ZipFile(f'{DATA_DIR}/home-credit-default-risk.zip', 'r')
    # extractall():  Extract all members from the archive to the current working directory. path specifies a different directory to extract to
    zip_ref.extractall('../data/home-credit-default-risk') 
    zip_ref.close()

Data files overview¶

Data Dictionary¶

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

image.png

Application train¶

In [ ]:
ls -l ../data/home-credit-default-risk/application_train.csv/
ls: cannot access '../data/home-credit-default-risk/application_train.csv/': Not a directory
In [ ]:
import numpy as np
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')

def load_data(in_path, name):
    df = pd.read_csv(in_path)
    print(f"{name}: shape is {df.shape}")
    print(df.info())
    display(df.head(5))
    return df

datasets={}  # lets store the datasets in a dictionary so we can keep track of them easily
ds_name = 'application_train'
DATA_DIR=f"../data/home-credit-default-risk/"
pd.set_option('display.max_columns', None)
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)

datasets['application_train'].shape
application_train: shape is (307511, 122)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
None
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.022 0.0198 0.0 0.0 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 Stone, brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0.0 0.0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
Out[ ]:
(307511, 122)

The training data has 307511 observations with each row representing one loan detail including Target feature (0: Loan repaid and 1: Loan not repaid) along with other 121 features.

In [ ]:
#COLUMNS PRESENT IN THE DATA
print(datasets['application_train'].columns)
Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY',
       ...
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object', length=122)
In [ ]:
DATA_DIR
Out[ ]:
'../data/home-credit-default-risk/'

Application test¶

  • application_train/application_test: the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid. The target variable defines if the client had payment difficulties meaning he/she had late payment more than X days on at least one of the first Y installments of the loan. Such case is marked as 1 while other all other cases as 0.
In [ ]:
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_test: shape is (48744, 121)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB
None
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100001 Cash loans F N Y 0 135000.0 568800.0 20560.5 450000.0 Unaccompanied Working Higher education Married House / apartment 0.018850 -19241 -2329 -5170.0 -812 NaN 1 1 0 1 0 1 NaN 2.0 2 2 TUESDAY 18 0 0 0 0 0 0 Kindergarten 0.752614 0.789654 0.159520 0.0660 0.0590 0.9732 NaN NaN NaN 0.1379 0.125 NaN NaN NaN 0.0505 NaN NaN 0.0672 0.0612 0.9732 NaN NaN NaN 0.1379 0.125 NaN NaN NaN 0.0526 NaN NaN 0.0666 0.0590 0.9732 NaN NaN NaN 0.1379 0.125 NaN NaN NaN 0.0514 NaN NaN NaN block of flats 0.0392 Stone, brick No 0.0 0.0 0.0 0.0 -1740.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
1 100005 Cash loans M N Y 0 99000.0 222768.0 17370.0 180000.0 Unaccompanied Working Secondary / secondary special Married House / apartment 0.035792 -18064 -4469 -9118.0 -1623 NaN 1 1 0 1 0 0 Low-skill Laborers 2.0 2 2 FRIDAY 9 0 0 0 0 0 0 Self-employed 0.564990 0.291656 0.432962 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
2 100013 Cash loans M Y Y 0 202500.0 663264.0 69777.0 630000.0 NaN Working Higher education Married House / apartment 0.019101 -20038 -4458 -2175.0 -3503 5.0 1 1 0 1 0 0 Drivers 2.0 2 2 MONDAY 14 0 0 0 0 0 0 Transport: type 3 NaN 0.699787 0.610991 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -856.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 1.0 4.0
3 100028 Cash loans F N Y 2 315000.0 1575000.0 49018.5 1575000.0 Unaccompanied Working Secondary / secondary special Married House / apartment 0.026392 -13976 -1866 -2000.0 -4208 NaN 1 1 0 1 1 0 Sales staff 4.0 2 2 WEDNESDAY 11 0 0 0 0 0 0 Business Entity Type 3 0.525734 0.509677 0.612704 0.3052 0.1974 0.9970 0.9592 0.1165 0.32 0.2759 0.375 0.0417 0.2042 0.2404 0.3673 0.0386 0.08 0.3109 0.2049 0.9970 0.9608 0.1176 0.3222 0.2759 0.375 0.0417 0.2089 0.2626 0.3827 0.0389 0.0847 0.3081 0.1974 0.9970 0.9597 0.1173 0.32 0.2759 0.375 0.0417 0.2078 0.2446 0.3739 0.0388 0.0817 reg oper account block of flats 0.3700 Panel No 0.0 0.0 0.0 0.0 -1805.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
4 100038 Cash loans M Y N 1 180000.0 625500.0 32067.0 625500.0 Unaccompanied Working Secondary / secondary special Married House / apartment 0.010032 -13040 -2191 -4000.0 -4262 16.0 1 1 1 1 0 0 NaN 3.0 2 2 FRIDAY 5 0 0 0 0 1 1 Business Entity Type 3 0.202145 0.425687 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -821.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets¶

  • bureau: data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
  • bureau_balance: monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
  • previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
  • POS_CASH_BALANCE: monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
  • credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
  • installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.
In [ ]:
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
            "previous_application","POS_CASH_balance")

for ds_name in ds_names:
    datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_train: shape is (307511, 122)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
None
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.022 0.0198 0.0 0.0 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 Stone, brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0.0 0.0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
application_test: shape is (48744, 121)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB
None
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100001 Cash loans F N Y 0 135000.0 568800.0 20560.5 450000.0 Unaccompanied Working Higher education Married House / apartment 0.018850 -19241 -2329 -5170.0 -812 NaN 1 1 0 1 0 1 NaN 2.0 2 2 TUESDAY 18 0 0 0 0 0 0 Kindergarten 0.752614 0.789654 0.159520 0.0660 0.0590 0.9732 NaN NaN NaN 0.1379 0.125 NaN NaN NaN 0.0505 NaN NaN 0.0672 0.0612 0.9732 NaN NaN NaN 0.1379 0.125 NaN NaN NaN 0.0526 NaN NaN 0.0666 0.0590 0.9732 NaN NaN NaN 0.1379 0.125 NaN NaN NaN 0.0514 NaN NaN NaN block of flats 0.0392 Stone, brick No 0.0 0.0 0.0 0.0 -1740.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
1 100005 Cash loans M N Y 0 99000.0 222768.0 17370.0 180000.0 Unaccompanied Working Secondary / secondary special Married House / apartment 0.035792 -18064 -4469 -9118.0 -1623 NaN 1 1 0 1 0 0 Low-skill Laborers 2.0 2 2 FRIDAY 9 0 0 0 0 0 0 Self-employed 0.564990 0.291656 0.432962 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
2 100013 Cash loans M Y Y 0 202500.0 663264.0 69777.0 630000.0 NaN Working Higher education Married House / apartment 0.019101 -20038 -4458 -2175.0 -3503 5.0 1 1 0 1 0 0 Drivers 2.0 2 2 MONDAY 14 0 0 0 0 0 0 Transport: type 3 NaN 0.699787 0.610991 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -856.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 1.0 4.0
3 100028 Cash loans F N Y 2 315000.0 1575000.0 49018.5 1575000.0 Unaccompanied Working Secondary / secondary special Married House / apartment 0.026392 -13976 -1866 -2000.0 -4208 NaN 1 1 0 1 1 0 Sales staff 4.0 2 2 WEDNESDAY 11 0 0 0 0 0 0 Business Entity Type 3 0.525734 0.509677 0.612704 0.3052 0.1974 0.9970 0.9592 0.1165 0.32 0.2759 0.375 0.0417 0.2042 0.2404 0.3673 0.0386 0.08 0.3109 0.2049 0.9970 0.9608 0.1176 0.3222 0.2759 0.375 0.0417 0.2089 0.2626 0.3827 0.0389 0.0847 0.3081 0.1974 0.9970 0.9597 0.1173 0.32 0.2759 0.375 0.0417 0.2078 0.2446 0.3739 0.0388 0.0817 reg oper account block of flats 0.3700 Panel No 0.0 0.0 0.0 0.0 -1805.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
4 100038 Cash loans M Y N 1 180000.0 625500.0 32067.0 625500.0 Unaccompanied Working Secondary / secondary special Married House / apartment 0.010032 -13040 -2191 -4000.0 -4262 16.0 1 1 1 1 0 0 NaN 3.0 2 2 FRIDAY 5 0 0 0 0 1 1 Business Entity Type 3 0.202145 0.425687 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -821.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
bureau: shape is (1716428, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_CURR              int64  
 1   SK_ID_BUREAU            int64  
 2   CREDIT_ACTIVE           object 
 3   CREDIT_CURRENCY         object 
 4   DAYS_CREDIT             int64  
 5   CREDIT_DAY_OVERDUE      int64  
 6   DAYS_CREDIT_ENDDATE     float64
 7   DAYS_ENDDATE_FACT       float64
 8   AMT_CREDIT_MAX_OVERDUE  float64
 9   CNT_CREDIT_PROLONG      int64  
 10  AMT_CREDIT_SUM          float64
 11  AMT_CREDIT_SUM_DEBT     float64
 12  AMT_CREDIT_SUM_LIMIT    float64
 13  AMT_CREDIT_SUM_OVERDUE  float64
 14  CREDIT_TYPE             object 
 15  DAYS_CREDIT_UPDATE      int64  
 16  AMT_ANNUITY             float64
dtypes: float64(8), int64(6), object(3)
memory usage: 222.6+ MB
None
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN
bureau_balance: shape is (27299925, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   SK_ID_BUREAU    int64 
 1   MONTHS_BALANCE  int64 
 2   STATUS          object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB
None
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C
credit_card_balance: shape is (3840312, 23)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Data columns (total 23 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   SK_ID_PREV                  int64  
 1   SK_ID_CURR                  int64  
 2   MONTHS_BALANCE              int64  
 3   AMT_BALANCE                 float64
 4   AMT_CREDIT_LIMIT_ACTUAL     int64  
 5   AMT_DRAWINGS_ATM_CURRENT    float64
 6   AMT_DRAWINGS_CURRENT        float64
 7   AMT_DRAWINGS_OTHER_CURRENT  float64
 8   AMT_DRAWINGS_POS_CURRENT    float64
 9   AMT_INST_MIN_REGULARITY     float64
 10  AMT_PAYMENT_CURRENT         float64
 11  AMT_PAYMENT_TOTAL_CURRENT   float64
 12  AMT_RECEIVABLE_PRINCIPAL    float64
 13  AMT_RECIVABLE               float64
 14  AMT_TOTAL_RECEIVABLE        float64
 15  CNT_DRAWINGS_ATM_CURRENT    float64
 16  CNT_DRAWINGS_CURRENT        int64  
 17  CNT_DRAWINGS_OTHER_CURRENT  float64
 18  CNT_DRAWINGS_POS_CURRENT    float64
 19  CNT_INSTALMENT_MATURE_CUM   float64
 20  NAME_CONTRACT_STATUS        object 
 21  SK_DPD                      int64  
 22  SK_DPD_DEF                  int64  
dtypes: float64(15), int64(7), object(1)
memory usage: 673.9+ MB
None
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY AMT_PAYMENT_CURRENT AMT_PAYMENT_TOTAL_CURRENT AMT_RECEIVABLE_PRINCIPAL AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 2562384 378907 -6 56.970 135000 0.0 877.5 0.0 877.5 1700.325 1800.0 1800.0 0.000 0.000 0.000 0.0 1 0.0 1.0 35.0 Active 0 0
1 2582071 363914 -1 63975.555 45000 2250.0 2250.0 0.0 0.0 2250.000 2250.0 2250.0 60175.080 64875.555 64875.555 1.0 1 0.0 0.0 69.0 Active 0 0
2 1740877 371185 -7 31815.225 450000 0.0 0.0 0.0 0.0 2250.000 2250.0 2250.0 26926.425 31460.085 31460.085 0.0 0 0.0 0.0 30.0 Active 0 0
3 1389973 337855 -4 236572.110 225000 2250.0 2250.0 0.0 0.0 11795.760 11925.0 11925.0 224949.285 233048.970 233048.970 1.0 1 0.0 0.0 10.0 Active 0 0
4 1891521 126868 -1 453919.455 450000 0.0 11547.0 0.0 11547.0 22924.890 27000.0 27000.0 443044.395 453919.455 453919.455 0.0 1 0.0 1.0 101.0 Active 0 0
installments_payments: shape is (13605401, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13605401 entries, 0 to 13605400
Data columns (total 8 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_PREV              int64  
 1   SK_ID_CURR              int64  
 2   NUM_INSTALMENT_VERSION  float64
 3   NUM_INSTALMENT_NUMBER   int64  
 4   DAYS_INSTALMENT         float64
 5   DAYS_ENTRY_PAYMENT      float64
 6   AMT_INSTALMENT          float64
 7   AMT_PAYMENT             float64
dtypes: float64(5), int64(3)
memory usage: 830.4 MB
None
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
0 1054186 161674 1.0 6 -1180.0 -1187.0 6948.360 6948.360
1 1330831 151639 0.0 34 -2156.0 -2156.0 1716.525 1716.525
2 2085231 193053 2.0 1 -63.0 -63.0 25425.000 25425.000
3 2452527 199697 1.0 3 -2418.0 -2426.0 24350.130 24350.130
4 2714724 167756 1.0 2 -1383.0 -1366.0 2165.040 2160.585
previous_application: shape is (1670214, 37)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int64  
 1   SK_ID_CURR                   1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   AMT_GOODS_PRICE              1284699 non-null  float64
 8   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 9   HOUR_APPR_PROCESS_START      1670214 non-null  int64  
 10  FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  object 
 11  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  int64  
 12  RATE_DOWN_PAYMENT            774370 non-null   float64
 13  RATE_INTEREST_PRIMARY        5951 non-null     float64
 14  RATE_INTEREST_PRIVILEGED     5951 non-null     float64
 15  NAME_CASH_LOAN_PURPOSE       1670214 non-null  object 
 16  NAME_CONTRACT_STATUS         1670214 non-null  object 
 17  DAYS_DECISION                1670214 non-null  int64  
 18  NAME_PAYMENT_TYPE            1670214 non-null  object 
 19  CODE_REJECT_REASON           1670214 non-null  object 
 20  NAME_TYPE_SUITE              849809 non-null   object 
 21  NAME_CLIENT_TYPE             1670214 non-null  object 
 22  NAME_GOODS_CATEGORY          1670214 non-null  object 
 23  NAME_PORTFOLIO               1670214 non-null  object 
 24  NAME_PRODUCT_TYPE            1670214 non-null  object 
 25  CHANNEL_TYPE                 1670214 non-null  object 
 26  SELLERPLACE_AREA             1670214 non-null  int64  
 27  NAME_SELLER_INDUSTRY         1670214 non-null  object 
 28  CNT_PAYMENT                  1297984 non-null  float64
 29  NAME_YIELD_GROUP             1670214 non-null  object 
 30  PRODUCT_COMBINATION          1669868 non-null  object 
 31  DAYS_FIRST_DRAWING           997149 non-null   float64
 32  DAYS_FIRST_DUE               997149 non-null   float64
 33  DAYS_LAST_DUE_1ST_VERSION    997149 non-null   float64
 34  DAYS_LAST_DUE                997149 non-null   float64
 35  DAYS_TERMINATION             997149 non-null   float64
 36  NFLAG_INSURED_ON_APPROVAL    997149 non-null   float64
dtypes: float64(15), int64(6), object(16)
memory usage: 471.5+ MB
None
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15 Y 1 0.0 0.182832 0.867336 XAP Approved -73 Cash through the bank XAP NaN Repeater Mobile POS XNA Country-wide 35 Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11 Y 1 NaN NaN NaN XNA Approved -164 XNA XAP Unaccompanied Repeater XNA Cash x-sell Contact center -1 XNA 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11 Y 1 NaN NaN NaN XNA Approved -301 Cash through the bank XAP Spouse, partner Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7 Y 1 NaN NaN NaN XNA Approved -512 Cash through the bank XAP NaN Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9 Y 1 NaN NaN NaN Repairs Refused -781 Cash through the bank HC NaN Repeater XNA Cash walk-in Credit and cash offices -1 XNA 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN
POS_CASH_balance: shape is (10001358, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   SK_ID_PREV             int64  
 1   SK_ID_CURR             int64  
 2   MONTHS_BALANCE         int64  
 3   CNT_INSTALMENT         float64
 4   CNT_INSTALMENT_FUTURE  float64
 5   NAME_CONTRACT_STATUS   object 
 6   SK_DPD                 int64  
 7   SK_DPD_DEF             int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 610.4+ MB
None
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 1803195 182943 -31 48.0 45.0 Active 0 0
1 1715348 367990 -33 36.0 35.0 Active 0 0
2 1784872 397406 -32 12.0 9.0 Active 0 0
3 1903291 269225 -35 48.0 42.0 Active 0 0
4 2341044 334279 -35 36.0 35.0 Active 0 0
CPU times: user 35 s, sys: 7.56 s, total: 42.5 s
Wall time: 43.4 s

Exploratory Data Analysis¶

Dataset sizes¶

In [ ]:
for ds_name in datasets.keys():
    print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
dataset application_train       : [    307,511, 122]
dataset application_test        : [     48,744, 121]
dataset bureau                  : [  1,716,428, 17]
dataset bureau_balance          : [ 27,299,925, 3]
dataset credit_card_balance     : [  3,840,312, 23]
dataset installments_payments   : [ 13,605,401, 8]
dataset previous_application    : [  1,670,214, 37]
dataset POS_CASH_balance        : [ 10,001,358, 8]

This process will summarize the data using statistical and visualization approaches with the objective to focus on key features of the data so that data can be cleaned for training

Summary of Application train¶

In [ ]:
# COLUMNS INFO
datasets["application_train"].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB

Observation:¶

  • There are 65 columns with float datatype, 41 columns with int and 16 columns with categorical value.
In [ ]:
datasets["application_train"].describe() #numerical only features
Out[ ]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.000000 307511.000000 307511.000000 3.075110e+05 3.075110e+05 307499.000000 3.072330e+05 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 104582.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307509.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 134133.000000 3.068510e+05 246546.000000 151450.00000 127568.000000 157504.000000 103023.000000 92646.000000 143620.000000 152683.000000 154491.000000 98869.000000 124921.000000 97312.000000 153161.000000 93997.000000 137829.000000 151450.000000 127568.000000 157504.000000 103023.000000 92646.000000 143620.000000 152683.000000 154491.000000 98869.000000 124921.000000 97312.000000 153161.000000 93997.000000 137829.000000 151450.000000 127568.000000 157504.000000 103023.000000 92646.000000 143620.000000 152683.000000 154491.000000 98869.000000 124921.000000 97312.000000 153161.000000 93997.000000 137829.000000 159080.000000 306490.000000 306490.000000 306490.000000 306490.000000 307510.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.00000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 265992.000000 265992.000000 265992.000000 265992.000000 265992.000000 265992.000000
mean 278180.518577 0.080729 0.417052 1.687979e+05 5.990260e+05 27108.573909 5.383962e+05 0.020868 -16036.995067 63815.045904 -4986.120328 -2994.202373 12.061091 0.999997 0.819889 0.199368 0.998133 0.281066 0.056720 2.152665 2.052463 2.031521 12.063419 0.015144 0.050769 0.040659 0.078173 0.230454 0.179555 0.502130 5.143927e-01 0.510853 0.11744 0.088442 0.977735 0.752471 0.044621 0.078942 0.149725 0.226282 0.231894 0.066333 0.100775 0.107399 0.008809 0.028358 0.114231 0.087543 0.977065 0.759637 0.042553 0.074490 0.145193 0.222315 0.228058 0.064958 0.105645 0.105975 0.008076 0.027022 0.117850 0.087955 0.977752 0.755746 0.044595 0.078078 0.149213 0.225897 0.231625 0.067169 0.101954 0.108607 0.008651 0.028236 0.102547 1.422245 0.143421 1.405292 0.100049 -962.858788 0.000042 0.710023 0.000081 0.015115 0.088055 0.000192 0.081376 0.003896 0.000023 0.003912 0.000007 0.003525 0.002936 0.00121 0.009928 0.000267 0.008130 0.000595 0.000507 0.000335 0.006402 0.007000 0.034362 0.267395 0.265474 1.899974
std 102790.175348 0.272419 0.722121 2.371231e+05 4.024908e+05 14493.737315 3.694465e+05 0.013831 4363.988632 141275.766519 3522.886321 1509.450419 11.944812 0.001803 0.384280 0.399526 0.043164 0.449521 0.231307 0.910682 0.509034 0.502737 3.265832 0.122126 0.219526 0.197499 0.268444 0.421124 0.383817 0.211062 1.910602e-01 0.194844 0.10824 0.082438 0.059223 0.113280 0.076036 0.134576 0.100049 0.144641 0.161380 0.081184 0.092576 0.110565 0.047732 0.069523 0.107936 0.084307 0.064575 0.110111 0.074445 0.132256 0.100977 0.143709 0.161160 0.081750 0.097880 0.111845 0.046276 0.070254 0.109076 0.082179 0.059897 0.112066 0.076144 0.134467 0.100368 0.145067 0.161934 0.082167 0.093642 0.112260 0.047415 0.070166 0.107462 2.400989 0.446698 2.379803 0.362291 826.808487 0.006502 0.453752 0.009016 0.122010 0.283376 0.013850 0.273412 0.062295 0.004771 0.062424 0.002550 0.059268 0.054110 0.03476 0.099144 0.016327 0.089798 0.024387 0.022518 0.018299 0.083849 0.110757 0.204685 0.916002 0.794056 1.869295
min 100002.000000 0.000000 0.000000 2.565000e+04 4.500000e+04 1615.500000 4.050000e+04 0.000290 -25229.000000 -17912.000000 -24672.000000 -7197.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.014568 8.173617e-08 0.000527 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -4292.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 189145.500000 0.000000 0.000000 1.125000e+05 2.700000e+05 16524.000000 2.385000e+05 0.010006 -19682.000000 -2760.000000 -7479.500000 -4299.000000 5.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 2.000000 2.000000 2.000000 10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.334007 3.924574e-01 0.370650 0.05770 0.044200 0.976700 0.687200 0.007800 0.000000 0.069000 0.166700 0.083300 0.018700 0.050400 0.045300 0.000000 0.000000 0.052500 0.040700 0.976700 0.699400 0.007200 0.000000 0.069000 0.166700 0.083300 0.016600 0.054200 0.042700 0.000000 0.000000 0.058300 0.043700 0.976700 0.691400 0.007900 0.000000 0.069000 0.166700 0.083300 0.018700 0.051300 0.045700 0.000000 0.000000 0.041200 0.000000 0.000000 0.000000 0.000000 -1570.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 278202.000000 0.000000 0.000000 1.471500e+05 5.135310e+05 24903.000000 4.500000e+05 0.018850 -15750.000000 -1213.000000 -4504.000000 -3254.000000 9.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 2.000000 2.000000 2.000000 12.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.505998 5.659614e-01 0.535276 0.08760 0.076300 0.981600 0.755200 0.021100 0.000000 0.137900 0.166700 0.208300 0.048100 0.075600 0.074500 0.000000 0.003600 0.084000 0.074600 0.981600 0.764800 0.019000 0.000000 0.137900 0.166700 0.208300 0.045800 0.077100 0.073100 0.000000 0.001100 0.086400 0.075800 0.981600 0.758500 0.020800 0.000000 0.137900 0.166700 0.208300 0.048700 0.076100 0.074900 0.000000 0.003100 0.068800 0.000000 0.000000 0.000000 0.000000 -757.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
75% 367142.500000 0.000000 1.000000 2.025000e+05 8.086500e+05 34596.000000 6.795000e+05 0.028663 -12413.000000 -289.000000 -2010.000000 -1720.000000 15.000000 1.000000 1.000000 0.000000 1.000000 1.000000 0.000000 3.000000 2.000000 2.000000 14.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.675053 6.636171e-01 0.669057 0.14850 0.112200 0.986600 0.823200 0.051500 0.120000 0.206900 0.333300 0.375000 0.085600 0.121000 0.129900 0.003900 0.027700 0.143900 0.112400 0.986600 0.823600 0.049000 0.120800 0.206900 0.333300 0.375000 0.084100 0.131300 0.125200 0.003900 0.023100 0.148900 0.111600 0.986600 0.825600 0.051300 0.120000 0.206900 0.333300 0.375000 0.086800 0.123100 0.130300 0.003900 0.026600 0.127600 2.000000 0.000000 2.000000 0.000000 -274.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000
max 456255.000000 1.000000 19.000000 1.170000e+08 4.050000e+06 258025.500000 4.050000e+06 0.072508 -7489.000000 365243.000000 0.000000 0.000000 91.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20.000000 3.000000 3.000000 23.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.962693 8.549997e-01 0.896010 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 348.000000 34.000000 344.000000 24.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 4.000000 9.000000 8.000000 27.000000 261.000000 25.000000
In [ ]:
datasets["application_test"].describe() #numerical only features
Out[ ]:
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 48744.000000 48744.000000 4.874400e+04 4.874400e+04 48720.000000 4.874400e+04 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 16432.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 28212.000000 48736.000000 40076.000000 24857.000000 21103.000000 25888.000000 16926.000000 15249.000000 23555.000000 25165.000000 25423.000000 16278.000000 20490.000000 15964.000000 25192.000000 15397.000000 22660.000000 24857.000000 21103.000000 25888.000000 16926.000000 15249.000000 23555.000000 25165.000000 25423.000000 16278.000000 20490.000000 15964.000000 25192.000000 15397.000000 22660.000000 24857.000000 21103.000000 25888.000000 16926.000000 15249.000000 23555.000000 25165.000000 25423.000000 16278.000000 20490.000000 15964.000000 25192.000000 15397.000000 22660.000000 26120.000000 48715.000000 48715.000000 48715.000000 48715.000000 48744.000000 48744.0 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.000000 48744.0 48744.000000 48744.0 48744.0 48744.0 48744.0 48744.0 48744.0 48744.000000 48744.0 48744.0 48744.0 42695.000000 42695.000000 42695.000000 42695.000000 42695.000000 42695.000000
mean 277796.676350 0.397054 1.784318e+05 5.167404e+05 29426.240209 4.626188e+05 0.021226 -16068.084605 67485.366322 -4967.652716 -3051.712949 11.786027 0.999979 0.809720 0.204702 0.998400 0.263130 0.162646 2.146767 2.038159 2.012596 12.007365 0.018833 0.055166 0.042036 0.077466 0.224664 0.174216 0.501180 0.518021 0.500106 0.122388 0.090065 0.978828 0.751137 0.047624 0.085168 0.151777 0.233706 0.238423 0.067192 0.105885 0.112286 0.009231 0.029387 0.119078 0.088998 0.978292 0.758327 0.045223 0.080570 0.147161 0.229390 0.233854 0.065914 0.110874 0.110687 0.008358 0.028161 0.122809 0.089529 0.978822 0.754344 0.047420 0.084128 0.151200 0.233154 0.237846 0.068069 0.107063 0.113368 0.008979 0.029296 0.107129 1.447644 0.143652 1.435738 0.101139 -1077.766228 0.0 0.786620 0.000103 0.014751 0.087477 0.000041 0.088462 0.004493 0.0 0.001169 0.0 0.0 0.0 0.0 0.0 0.0 0.001559 0.0 0.0 0.0 0.002108 0.001803 0.002787 0.009299 0.546902 1.983769
std 103169.547296 0.709047 1.015226e+05 3.653970e+05 16016.368315 3.367102e+05 0.014428 4325.900393 144348.507136 3552.612035 1569.276709 11.462889 0.004529 0.392526 0.403488 0.039971 0.440337 0.369046 0.890423 0.522694 0.515804 3.278172 0.135937 0.228306 0.200673 0.267332 0.417365 0.379299 0.205142 0.181278 0.189498 0.113112 0.081536 0.049318 0.113188 0.082868 0.139164 0.100669 0.147361 0.164976 0.081909 0.098284 0.114860 0.048749 0.072007 0.113465 0.082655 0.053782 0.110117 0.081169 0.137509 0.101748 0.146485 0.165034 0.082880 0.103980 0.116699 0.046657 0.073504 0.114184 0.081022 0.049663 0.111998 0.082892 0.139014 0.100931 0.147629 0.165241 0.082869 0.099737 0.116503 0.048148 0.072998 0.111420 3.608053 0.514413 3.580125 0.403791 878.920740 0.0 0.409698 0.010128 0.120554 0.282536 0.006405 0.283969 0.066879 0.0 0.034176 0.0 0.0 0.0 0.0 0.0 0.0 0.039456 0.0 0.0 0.0 0.046373 0.046132 0.054037 0.110924 0.693305 1.838873
min 100001.000000 0.000000 2.694150e+04 4.500000e+04 2295.000000 4.500000e+04 0.000253 -25195.000000 -17463.000000 -23722.000000 -6348.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 -1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.013458 0.000008 0.000527 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -4361.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 188557.750000 0.000000 1.125000e+05 2.606400e+05 17973.000000 2.250000e+05 0.010006 -19637.000000 -2910.000000 -7459.250000 -4448.000000 4.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 2.000000 2.000000 2.000000 10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.343695 0.408066 0.363945 0.061900 0.046700 0.976700 0.687200 0.008100 0.000000 0.074500 0.166700 0.104200 0.019000 0.050400 0.048575 0.000000 0.000000 0.058800 0.042500 0.976200 0.692900 0.007600 0.000000 0.069000 0.166700 0.083300 0.016525 0.055100 0.045600 0.000000 0.000000 0.062500 0.046150 0.976700 0.691400 0.008000 0.000000 0.069000 0.166700 0.083300 0.019000 0.051300 0.049000 0.000000 0.000000 0.043200 0.000000 0.000000 0.000000 0.000000 -1766.250000 0.0 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 277549.000000 0.000000 1.575000e+05 4.500000e+05 26199.000000 3.960000e+05 0.018850 -15785.000000 -1293.000000 -4490.000000 -3234.000000 9.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 2.000000 2.000000 2.000000 12.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.506771 0.558758 0.519097 0.092800 0.078100 0.981600 0.755200 0.022700 0.000000 0.137900 0.166700 0.208300 0.048300 0.075600 0.077000 0.000000 0.003800 0.085100 0.077000 0.981600 0.758300 0.020300 0.000000 0.137900 0.166700 0.208300 0.046200 0.081700 0.075100 0.000000 0.001200 0.092600 0.077800 0.981600 0.758500 0.022300 0.000000 0.137900 0.166700 0.208300 0.048800 0.077000 0.077600 0.000000 0.003100 0.070700 0.000000 0.000000 0.000000 0.000000 -863.000000 0.0 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000
75% 367555.500000 1.000000 2.250000e+05 6.750000e+05 37390.500000 6.300000e+05 0.028663 -12496.000000 -296.000000 -1901.000000 -1706.000000 15.000000 1.000000 1.000000 0.000000 1.000000 1.000000 0.000000 3.000000 2.000000 2.000000 14.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.665956 0.658497 0.652897 0.148500 0.113400 0.986600 0.816400 0.053900 0.160000 0.206900 0.333300 0.375000 0.086800 0.126900 0.137600 0.005100 0.029000 0.150200 0.113550 0.986600 0.823600 0.051700 0.120800 0.206900 0.333300 0.375000 0.085600 0.132200 0.130600 0.003900 0.024500 0.149900 0.113000 0.986600 0.818900 0.053800 0.160000 0.206900 0.333300 0.375000 0.088000 0.126600 0.137425 0.003900 0.028025 0.135700 2.000000 0.000000 2.000000 0.000000 -363.000000 0.0 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 1.000000 3.000000
max 456250.000000 20.000000 4.410000e+06 2.245500e+06 180576.000000 2.245500e+06 0.072508 -7338.000000 365243.000000 0.000000 0.000000 74.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 21.000000 3.000000 3.000000 23.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.939145 0.855000 0.882530 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 354.000000 34.000000 351.000000 24.000000 0.000000 0.0 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.0 1.000000 0.0 0.0 0.0 0.0 0.0 0.0 1.000000 0.0 0.0 0.0 2.000000 2.000000 2.000000 6.000000 7.000000 17.000000
In [ ]:
datasets["application_train"].describe(include='all') #look at all categorical and numerical
Out[ ]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.000000 307511.000000 307511 307511 307511 307511 307511.000000 3.075110e+05 3.075110e+05 307499.000000 3.072330e+05 306219 307511 307511 307511 307511 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 104582.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 211120 307509.000000 307511.000000 307511.000000 307511 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511 134133.000000 3.068510e+05 246546.000000 151450.00000 127568.000000 157504.000000 103023.000000 92646.000000 143620.000000 152683.000000 154491.000000 98869.000000 124921.000000 97312.000000 153161.000000 93997.000000 137829.000000 151450.000000 127568.000000 157504.000000 103023.000000 92646.000000 143620.000000 152683.000000 154491.000000 98869.000000 124921.000000 97312.000000 153161.000000 93997.000000 137829.000000 151450.000000 127568.000000 157504.000000 103023.000000 92646.000000 143620.000000 152683.000000 154491.000000 98869.000000 124921.000000 97312.000000 153161.000000 93997.000000 137829.000000 97216 153214 159080.000000 151170 161756 306490.000000 306490.000000 306490.000000 306490.000000 307510.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.00000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 265992.000000 265992.000000 265992.000000 265992.000000 265992.000000 265992.000000
unique NaN NaN 2 3 2 2 NaN NaN NaN NaN NaN 7 8 5 6 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 18 NaN NaN NaN 7 NaN NaN NaN NaN NaN NaN NaN 58 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4 3 NaN 7 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
top NaN NaN Cash loans F N Y NaN NaN NaN NaN NaN Unaccompanied Working Secondary / secondary special Married House / apartment NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Laborers NaN NaN NaN TUESDAY NaN NaN NaN NaN NaN NaN NaN Business Entity Type 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN reg oper account block of flats NaN Panel No NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq NaN NaN 278232 202448 202924 213312 NaN NaN NaN NaN NaN 248526 158774 218391 196432 272868 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 55186 NaN NaN NaN 53901 NaN NaN NaN NaN NaN NaN NaN 67992 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 73830 150503 NaN 66040 159428 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean 278180.518577 0.080729 NaN NaN NaN NaN 0.417052 1.687979e+05 5.990260e+05 27108.573909 5.383962e+05 NaN NaN NaN NaN NaN 0.020868 -16036.995067 63815.045904 -4986.120328 -2994.202373 12.061091 0.999997 0.819889 0.199368 0.998133 0.281066 0.056720 NaN 2.152665 2.052463 2.031521 NaN 12.063419 0.015144 0.050769 0.040659 0.078173 0.230454 0.179555 NaN 0.502130 5.143927e-01 0.510853 0.11744 0.088442 0.977735 0.752471 0.044621 0.078942 0.149725 0.226282 0.231894 0.066333 0.100775 0.107399 0.008809 0.028358 0.114231 0.087543 0.977065 0.759637 0.042553 0.074490 0.145193 0.222315 0.228058 0.064958 0.105645 0.105975 0.008076 0.027022 0.117850 0.087955 0.977752 0.755746 0.044595 0.078078 0.149213 0.225897 0.231625 0.067169 0.101954 0.108607 0.008651 0.028236 NaN NaN 0.102547 NaN NaN 1.422245 0.143421 1.405292 0.100049 -962.858788 0.000042 0.710023 0.000081 0.015115 0.088055 0.000192 0.081376 0.003896 0.000023 0.003912 0.000007 0.003525 0.002936 0.00121 0.009928 0.000267 0.008130 0.000595 0.000507 0.000335 0.006402 0.007000 0.034362 0.267395 0.265474 1.899974
std 102790.175348 0.272419 NaN NaN NaN NaN 0.722121 2.371231e+05 4.024908e+05 14493.737315 3.694465e+05 NaN NaN NaN NaN NaN 0.013831 4363.988632 141275.766519 3522.886321 1509.450419 11.944812 0.001803 0.384280 0.399526 0.043164 0.449521 0.231307 NaN 0.910682 0.509034 0.502737 NaN 3.265832 0.122126 0.219526 0.197499 0.268444 0.421124 0.383817 NaN 0.211062 1.910602e-01 0.194844 0.10824 0.082438 0.059223 0.113280 0.076036 0.134576 0.100049 0.144641 0.161380 0.081184 0.092576 0.110565 0.047732 0.069523 0.107936 0.084307 0.064575 0.110111 0.074445 0.132256 0.100977 0.143709 0.161160 0.081750 0.097880 0.111845 0.046276 0.070254 0.109076 0.082179 0.059897 0.112066 0.076144 0.134467 0.100368 0.145067 0.161934 0.082167 0.093642 0.112260 0.047415 0.070166 NaN NaN 0.107462 NaN NaN 2.400989 0.446698 2.379803 0.362291 826.808487 0.006502 0.453752 0.009016 0.122010 0.283376 0.013850 0.273412 0.062295 0.004771 0.062424 0.002550 0.059268 0.054110 0.03476 0.099144 0.016327 0.089798 0.024387 0.022518 0.018299 0.083849 0.110757 0.204685 0.916002 0.794056 1.869295
min 100002.000000 0.000000 NaN NaN NaN NaN 0.000000 2.565000e+04 4.500000e+04 1615.500000 4.050000e+04 NaN NaN NaN NaN NaN 0.000290 -25229.000000 -17912.000000 -24672.000000 -7197.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN 1.000000 1.000000 1.000000 NaN 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN 0.014568 8.173617e-08 0.000527 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN NaN 0.000000 NaN NaN 0.000000 0.000000 0.000000 0.000000 -4292.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 189145.500000 0.000000 NaN NaN NaN NaN 0.000000 1.125000e+05 2.700000e+05 16524.000000 2.385000e+05 NaN NaN NaN NaN NaN 0.010006 -19682.000000 -2760.000000 -7479.500000 -4299.000000 5.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 NaN 2.000000 2.000000 2.000000 NaN 10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN 0.334007 3.924574e-01 0.370650 0.05770 0.044200 0.976700 0.687200 0.007800 0.000000 0.069000 0.166700 0.083300 0.018700 0.050400 0.045300 0.000000 0.000000 0.052500 0.040700 0.976700 0.699400 0.007200 0.000000 0.069000 0.166700 0.083300 0.016600 0.054200 0.042700 0.000000 0.000000 0.058300 0.043700 0.976700 0.691400 0.007900 0.000000 0.069000 0.166700 0.083300 0.018700 0.051300 0.045700 0.000000 0.000000 NaN NaN 0.041200 NaN NaN 0.000000 0.000000 0.000000 0.000000 -1570.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 278202.000000 0.000000 NaN NaN NaN NaN 0.000000 1.471500e+05 5.135310e+05 24903.000000 4.500000e+05 NaN NaN NaN NaN NaN 0.018850 -15750.000000 -1213.000000 -4504.000000 -3254.000000 9.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 NaN 2.000000 2.000000 2.000000 NaN 12.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN 0.505998 5.659614e-01 0.535276 0.08760 0.076300 0.981600 0.755200 0.021100 0.000000 0.137900 0.166700 0.208300 0.048100 0.075600 0.074500 0.000000 0.003600 0.084000 0.074600 0.981600 0.764800 0.019000 0.000000 0.137900 0.166700 0.208300 0.045800 0.077100 0.073100 0.000000 0.001100 0.086400 0.075800 0.981600 0.758500 0.020800 0.000000 0.137900 0.166700 0.208300 0.048700 0.076100 0.074900 0.000000 0.003100 NaN NaN 0.068800 NaN NaN 0.000000 0.000000 0.000000 0.000000 -757.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
75% 367142.500000 0.000000 NaN NaN NaN NaN 1.000000 2.025000e+05 8.086500e+05 34596.000000 6.795000e+05 NaN NaN NaN NaN NaN 0.028663 -12413.000000 -289.000000 -2010.000000 -1720.000000 15.000000 1.000000 1.000000 0.000000 1.000000 1.000000 0.000000 NaN 3.000000 2.000000 2.000000 NaN 14.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN 0.675053 6.636171e-01 0.669057 0.14850 0.112200 0.986600 0.823200 0.051500 0.120000 0.206900 0.333300 0.375000 0.085600 0.121000 0.129900 0.003900 0.027700 0.143900 0.112400 0.986600 0.823600 0.049000 0.120800 0.206900 0.333300 0.375000 0.084100 0.131300 0.125200 0.003900 0.023100 0.148900 0.111600 0.986600 0.825600 0.051300 0.120000 0.206900 0.333300 0.375000 0.086800 0.123100 0.130300 0.003900 0.026600 NaN NaN 0.127600 NaN NaN 2.000000 0.000000 2.000000 0.000000 -274.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000
max 456255.000000 1.000000 NaN NaN NaN NaN 19.000000 1.170000e+08 4.050000e+06 258025.500000 4.050000e+06 NaN NaN NaN NaN NaN 0.072508 -7489.000000 365243.000000 0.000000 0.000000 91.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 NaN 20.000000 3.000000 3.000000 NaN 23.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 NaN 0.962693 8.549997e-01 0.896010 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 NaN NaN 1.000000 NaN NaN 348.000000 34.000000 344.000000 24.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 4.000000 9.000000 8.000000 27.000000 261.000000 25.000000

Missing data for application train¶

In [ ]:
percent = (datasets["application_train"].isnull().sum()/datasets["application_train"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_train"].isna().sum().sort_values(ascending = False)
missing_application_train_data  = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data.head(20)
Out[ ]:
Percent Train Missing Count
COMMONAREA_MEDI 69.87 214865
COMMONAREA_AVG 69.87 214865
COMMONAREA_MODE 69.87 214865
NONLIVINGAPARTMENTS_MODE 69.43 213514
NONLIVINGAPARTMENTS_AVG 69.43 213514
NONLIVINGAPARTMENTS_MEDI 69.43 213514
FONDKAPREMONT_MODE 68.39 210295
LIVINGAPARTMENTS_MODE 68.35 210199
LIVINGAPARTMENTS_AVG 68.35 210199
LIVINGAPARTMENTS_MEDI 68.35 210199
FLOORSMIN_AVG 67.85 208642
FLOORSMIN_MODE 67.85 208642
FLOORSMIN_MEDI 67.85 208642
YEARS_BUILD_MEDI 66.50 204488
YEARS_BUILD_MODE 66.50 204488
YEARS_BUILD_AVG 66.50 204488
OWN_CAR_AGE 65.99 202929
LANDAREA_MEDI 59.38 182590
LANDAREA_MODE 59.38 182590
LANDAREA_AVG 59.38 182590
In [ ]:
missinga5 = missing_application_train_data.loc[missing_application_train_data['Percent'] > 55]
missinga5
Out[ ]:
Percent Train Missing Count
COMMONAREA_MEDI 69.87 214865
COMMONAREA_AVG 69.87 214865
COMMONAREA_MODE 69.87 214865
NONLIVINGAPARTMENTS_MODE 69.43 213514
NONLIVINGAPARTMENTS_AVG 69.43 213514
NONLIVINGAPARTMENTS_MEDI 69.43 213514
FONDKAPREMONT_MODE 68.39 210295
LIVINGAPARTMENTS_MODE 68.35 210199
LIVINGAPARTMENTS_AVG 68.35 210199
LIVINGAPARTMENTS_MEDI 68.35 210199
FLOORSMIN_AVG 67.85 208642
FLOORSMIN_MODE 67.85 208642
FLOORSMIN_MEDI 67.85 208642
YEARS_BUILD_MEDI 66.50 204488
YEARS_BUILD_MODE 66.50 204488
YEARS_BUILD_AVG 66.50 204488
OWN_CAR_AGE 65.99 202929
LANDAREA_MEDI 59.38 182590
LANDAREA_MODE 59.38 182590
LANDAREA_AVG 59.38 182590
BASEMENTAREA_MEDI 58.52 179943
BASEMENTAREA_AVG 58.52 179943
BASEMENTAREA_MODE 58.52 179943
EXT_SOURCE_1 56.38 173378
NONLIVINGAREA_MODE 55.18 169682
NONLIVINGAREA_AVG 55.18 169682
NONLIVINGAREA_MEDI 55.18 169682

In all of the above columns with more than 55% data missing the below three are the least correlated with the TARGET value. Hence we are dropping them from the table entirely.

In [ ]:
drop = datasets["application_train"]['NONLIVINGAPARTMENTS_MODE'].corr(datasets["application_train"]['TARGET'])
drop
Out[ ]:
-0.0015565608459037438
In [ ]:
drop = datasets["application_train"]['NONLIVINGAPARTMENTS_AVG'].corr(datasets["application_train"]['TARGET'])
drop
Out[ ]:
-0.0031761075207261394
In [ ]:
drop = datasets["application_train"]['NONLIVINGAPARTMENTS_MEDI'].corr(datasets["application_train"]['TARGET'])
drop
Out[ ]:
-0.0027571486021083823
In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt

# Filter rows with count of missing values greater than 0
missing_application_train_data = missing_application_train_data[missing_application_train_data['Train Missing Count'] > 0]
print("Number of columns with missing value greater than 0:",len(missing_application_train_data))
missing_more_than_60=missing_application_train_data[missing_application_train_data['Percent'] >=60]
print("Number of columns with more than 60% missing values:", len(missing_more_than_60))

plt.figure(figsize=(15, 25))

sns.set_style("whitegrid")
ax = sns.barplot(x="Train Missing Count", y=missing_application_train_data.index, data=missing_application_train_data,
                 palette="Blues_d")

# Set the figure size and axis labels

ax.set_xlabel("Number of Missing Values")
ax.set_ylabel("Features")

plt.title('Missing values in Train columns')

# Add values to the bars
for i, v in enumerate(missing_application_train_data['Train Missing Count']):
    ax.text(v + 10, i + .25, str(v), color='black', fontweight='bold')

# Show the plot
plt.show()
Number of columns with missing value greater than 0: 67
Number of columns with more than 60% missing values: 17
In [ ]:
percent = (datasets["application_test"].isnull().sum()/datasets["application_test"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_test"].isna().sum().sort_values(ascending = False)
missing_application_test_data  = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_test_data.head(20)
Out[ ]:
Percent Test Missing Count
COMMONAREA_AVG 68.72 33495
COMMONAREA_MODE 68.72 33495
COMMONAREA_MEDI 68.72 33495
NONLIVINGAPARTMENTS_AVG 68.41 33347
NONLIVINGAPARTMENTS_MODE 68.41 33347
NONLIVINGAPARTMENTS_MEDI 68.41 33347
FONDKAPREMONT_MODE 67.28 32797
LIVINGAPARTMENTS_AVG 67.25 32780
LIVINGAPARTMENTS_MODE 67.25 32780
LIVINGAPARTMENTS_MEDI 67.25 32780
FLOORSMIN_MEDI 66.61 32466
FLOORSMIN_AVG 66.61 32466
FLOORSMIN_MODE 66.61 32466
OWN_CAR_AGE 66.29 32312
YEARS_BUILD_AVG 65.28 31818
YEARS_BUILD_MEDI 65.28 31818
YEARS_BUILD_MODE 65.28 31818
LANDAREA_MEDI 57.96 28254
LANDAREA_AVG 57.96 28254
LANDAREA_MODE 57.96 28254
In [ ]:
# Filter rows with count of missing values greater than 0
missing_application_test_data = missing_application_test_data[missing_application_test_data['Test Missing Count'] > 0]
print("Number of columns with missing value greater than 0:",len(missing_application_test_data))
missing_more_than_60=missing_application_test_data[missing_application_test_data['Percent'] >=60]
print("Number of columns with more than 60% missing values:", len(missing_more_than_60))

plt.figure(figsize=(15, 25))

sns.set_style("whitegrid")
ax = sns.barplot(x="Test Missing Count", y=missing_application_test_data.index, data=missing_application_test_data,
                 palette="rocket")

# Set the figure size and axis labels

ax.set_xlabel("Number of Missing Values ")
ax.set_ylabel("Features")

plt.title('Missing values in Test columns')

# Add values to the bars
for i, v in enumerate(missing_application_test_data['Test Missing Count']):
    ax.text(v + 10, i + .25, str(v), color='black', fontweight='bold')

# Show the plot
plt.show()
Number of columns with missing value greater than 0: 64
Number of columns with more than 60% missing values: 17

Observation¶

  • 17 columns have more than 60% of the data missing in both train and test file.
  • 67 columns in application train have missing vales whereas 64 columns in application test have missing values
  • It's better to check the correlation of these columns with the target variable and drop the columns with very low correlation with the target variable instead of filling the missing values in these columns using an imputer as the mean of the column will be ambiguous due to a lot of null values but we can experiment with all the columns right now and later we can remove them if the accuracy is coming low.
In [ ]:
datasets["application_train"] = datasets["application_train"].drop(['NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MEDI'], axis=1)
datasets["application_train"].shape
Out[ ]:
(307511, 119)
In [ ]:
datasets["application_test"] = datasets["application_test"].drop(['NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MEDI'], axis=1)
datasets["application_test"].shape
Out[ ]:
(48744, 118)

As we can see above in the shape that both in the train and test datasets we have dropped the three columns that had a very low correlation to the target of the dataset.

Distribution of the target column¶

In [ ]:
# Create a bar plot of the distribution of the target column
sns.set(style='darkgrid')
ax = sns.countplot(x='TARGET', data=datasets['application_train'])

# Set the plot title and axis labels
plt.title('Distribution of Target Column')
plt.xlabel('TARGET')
plt.ylabel('Count')

# Set the tick labels for the x-axis
plt.xticks([0, 1], ['Loan Repaid', 'Loan Not-Repaid'])

# Add percentage values on top of the bars
total = float(len(datasets['application_train']))
for p in ax.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height() / total)
    x = p.get_x() + p.get_width() / 2
    y = p.get_height()
    ax.annotate(percentage, (x, y), ha='center', va='center', fontsize=12, color='black', xytext=(0, 5),
                textcoords='offset points')

# Show the plot
plt.show()

Observation¶

  • There's huge a class imbalance between target 0 and 1, we need to resolve this issue later before passing the input data to the model
  • 91.9% of the loans get repaid whereas 8.1% of loans is still not repaid

Correlation with the target column¶

In [ ]:
correlations = datasets["application_train"].corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(10))
print('\nMost Negative Correlations:\n', correlations.head(10))
Most Positive Correlations:
 FLAG_DOCUMENT_3                0.044346
REG_CITY_NOT_LIVE_CITY         0.044395
FLAG_EMP_PHONE                 0.045982
REG_CITY_NOT_WORK_CITY         0.050994
DAYS_ID_PUBLISH                0.051457
DAYS_LAST_PHONE_CHANGE         0.055218
REGION_RATING_CLIENT           0.058899
REGION_RATING_CLIENT_W_CITY    0.060893
DAYS_BIRTH                     0.078239
TARGET                         1.000000
Name: TARGET, dtype: float64

Most Negative Correlations:
 EXT_SOURCE_3                 -0.178919
EXT_SOURCE_2                 -0.160472
EXT_SOURCE_1                 -0.155317
DAYS_EMPLOYED                -0.044932
FLOORSMAX_AVG                -0.044003
FLOORSMAX_MEDI               -0.043768
FLOORSMAX_MODE               -0.043226
AMT_GOODS_PRICE              -0.039645
REGION_POPULATION_RELATIVE   -0.037227
ELEVATORS_AVG                -0.034199
Name: TARGET, dtype: float64
In [ ]:
correlations = datasets["application_train"].corr()
sns.set_theme(style="white")
sns.heatmap(correlations)
Out[ ]:
<Axes: >

Observation¶

  • Check which columns has more than 90% correlation, we can keep one
  • But this particular correlation map is too huge to make any sensible observations from it so we will dig deeper into multiple individual columns

Loan Repaid Analysis Based on Gender¶

In [ ]:
# First pie chart - loan data by gender
fig, ax = plt.subplots(figsize=(6, 6))

gender_counts = datasets["application_train"]['CODE_GENDER'].value_counts()
gender_labels = gender_counts.index
gender_sizes = gender_counts.values
colors = ["#497AA7", "#73A4CA"]
ax.pie(gender_sizes, labels=gender_labels, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 12}, colors=colors)
# Add a common title for the figure
ax.set_title('Number of Loans taken based on gender ', fontsize=16)

# Set the aspect ratio to "equal" and the x and y limits to the same value to make it a perfect circle
ax.set_aspect('equal')
ax.set_xlim([-1.1, 1.1])
ax.set_ylim([-1.1, 1.1])

# Show the plot
plt.show()
In [ ]:
fig,(ax2,ax3)=plt.subplots(1,2, figsize=(10, 5))
colors = ['#FEB5D4', "#E7C5FF"]
# Second pie chart - loan repaid by gender
gender_loan_repaid = datasets["application_train"][datasets["application_train"]['TARGET'] == 0]
gender_loan_repaid_counts = gender_loan_repaid['CODE_GENDER'].value_counts()
gender_loan_repaid_labels = gender_loan_repaid_counts.index
gender_loan_repaid_sizes = gender_loan_repaid_counts.values
ax2.pie(gender_loan_repaid_sizes, labels=gender_loan_repaid_labels, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 12}, colors=colors)
ax2.set_title('Loan Repaid by Gender')

# Third pie chart - loan not repaid by gender
gender_loan_not_repaid = datasets["application_train"][datasets["application_train"]['TARGET'] == 1]
gender_loan_not_repaid_counts = gender_loan_not_repaid['CODE_GENDER'].value_counts()
gender_loan_not_repaid_labels = gender_loan_not_repaid_counts.index
gender_loan_not_repaid_sizes = gender_loan_not_repaid_counts.values
ax3.pie(gender_loan_not_repaid_sizes, labels=gender_loan_not_repaid_labels, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 12}, colors=colors)
ax3.set_title('Loan Not Repaid by Gender')



# Adjust the layout and spacing of the subplots
plt.tight_layout(pad=2)

# Show the plot
plt.show()

Observation¶

  • We can observe from the above pie charts that the number of loans taken out by women are almost twice as that of men
  • The chances of loan repaid by females is also twice the amount than that for male
  • The chances of loan not getting repaid is 10% more in female than male

Applicants age and whether they repaid loan or not¶

In [ ]:
plt.hist(datasets["application_train"]['DAYS_BIRTH'] / -365, edgecolor = 'k', bins = 25, color = "#FFD66A")
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');
In [ ]:
plt.figure(figsize = (12, 6))
sns.kdeplot(datasets["application_train"].loc[datasets["application_train"]['TARGET']==0, 'DAYS_BIRTH'] / -365, color='purple')
sns.kdeplot(datasets["application_train"].loc[datasets["application_train"]['TARGET']==1, 'DAYS_BIRTH'] / -365, color='green')
plt.xlabel('Age of applicant')
plt.title('Repayment of loan based on age', fontsize=16)
plt.show()

Observation¶

  • In general most of the clients are in the age group of around 35-45 years.
  • Accoding to the above density graph we can see that many people around the age group of 30 years wont be able to repay the loan.
  • People around the age of 40 are on an equal scale in other words its a 50-50 chance that they will repay their loan.
  • People from the age of 50 normally should be able to pay back their loan.

Applicants Housing Situation¶

In [ ]:
fig, axs = plt.subplots(ncols=1, figsize=(12, 6))
# Create a countplot of NAME_HOUSING_TYPE vs TARGET
sns.countplot(x='NAME_HOUSING_TYPE', hue='TARGET', data=datasets['application_train'], palette = 'crest')
axs.set_title('Applicants Housing Situation')
axs.set_ylabel('Count')
axs.set_xlabel('Housing Type')
plt.show()

Observation¶

  • The people who have their own house or apartment mostly will be able to repay their loan.
  • We can also see that the number of people who do not have an own aparment or house and who seek for a loan are very less in number.

Loan repayment based on Car Ownership and Rental Property¶

In [ ]:
# Create a pair of subplots with a shared y-axis
fig, axs = plt.subplots(ncols=2, figsize=(14, 6))

# Create a countplot of FLAG_OWN_CAR vs TARGET
sns.countplot(x='FLAG_OWN_CAR', hue='TARGET', data=datasets['application_train'], ax=axs[0],palette='Set1')
axs[0].set_title('Loan Repaid Based on Car Ownership')

# Create a countplot of FLAG_OWN_REALTY vs TARGET
sns.countplot(x='FLAG_OWN_REALTY', hue='TARGET', data=datasets['application_train'], ax=axs[1],palette='Set1')
axs[1].set_title('Loan Repaid Based on Realty Ownership')

# Set the y-axis label
axs[0].set_ylabel('Count')
axs[1].set_ylabel('Count')

# Set the x-axis label
axs[0].set_xlabel('Car Ownership')
axs[1].set_xlabel('Realty Ownership')

# Show the plot
plt.show()

Observation¶

In the above two graphs we can see that the fact that people own a car or a realty or not does not make a difference on weather they pay their loan back or not.

Repayment of Loan Based on the Loan Type¶

In [ ]:
# First pie chart - loan data by type
fig, ax = plt.subplots(figsize=(6, 6))

loantype_counts = datasets["application_train"]['NAME_CONTRACT_TYPE'].value_counts()
loantype_labels = loantype_counts.index
loantype_sizes = loantype_counts.values
colors = ['#A7E8CB', '#C4DDFF']
ax.pie(loantype_sizes, labels=loantype_labels, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 12}, colors=colors)
# Add a common title for the figure
ax.set_title('Number of Loans taken based on loan type', fontsize=16)

# Set the aspect ratio to "equal" and the x and y limits to the same value to make it a perfect circle
ax.set_aspect('equal')
ax.set_xlim([-1.1, 1.1])
ax.set_ylim([-1.1, 1.1])

# Show the plot
plt.show()

Observation¶

Almost 90% of the loans that are taken out by applicants are cash loans and the other left out percentage are revolving loans. This shows that a preference towards cash loan exists.

Background of the Applicant¶

Education¶

In [ ]:
fig, axs = plt.subplots(ncols=1, figsize=(12, 6))

# Create a countplot of NAME_EDUCATION_TYPE vs TARGET
sns.countplot(x='NAME_EDUCATION_TYPE', hue='TARGET', data=datasets['application_train'], palette = 'rocket')
axs.set_title('Education level of the Applicant')

# Set the y-axis label
axs.set_ylabel('Count')
axs.set_xlabel('Education Level')

# Show the plot
plt.show()

Observation¶

We can see that most of the applicants have done atleast secondary level of education and a good amount of them have done higher education as well.

Organisation¶

In [ ]:
fig, axs = plt.subplots(ncols=1, figsize=(20, 24))

# Create a countplot of ORGANIZATION_TYPE vs TARGET
sns.countplot(y='ORGANIZATION_TYPE', hue='TARGET', data=datasets['application_train'], palette = 'rocket')
axs.set_title('Organisation that the Applicant is working in')

# Set the x-axis label
axs.set_xlabel('Count')
axs.set_ylabel('Organisation')

# Show the plot
plt.show()

Observation¶

We can observe that most of the applicants are working in a business entity of some sort but we if we notice there is an anomaly here that the second maximum number of applicants are working under a type of null value and are not labelled appropriately so we need to take care of this in the next phase of the project.

Occupation¶

In [ ]:
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"]);
plt.title('Applicants Occupation');
plt.xticks(rotation=90);

Observation¶

Most people applying for the loan are working as laborers or sales staff members which seems accurate if u consider the real world scenario so we do not have to do any changes to this column.

Distribution Plots for AMT columns¶

AMT Credit¶

In [ ]:
plt.figure(figsize=(12, 6)) 
sns.boxplot(x=datasets['application_train']['AMT_CREDIT'], palette='Set2') 
plt.xlabel('Credit')
plt.title('Boxplot for Credit') 
plt.show()

Observation¶

Credit is a very important column for any applicant but we can observe many outliers in the graph but since the number of outliers looks so high we might loose valuable information if we do any outlier removal. Also the feature is right skewed.

AMT Annuity¶

In [ ]:
plt.figure(figsize=(12, 6)) 
sns.distplot(x=datasets['application_train']['AMT_ANNUITY'], color='purple') 
plt.xlabel('Annuity')
plt.title('Distribution plot for Annuity') 
plt.show()

AMT Goods Price¶

In [ ]:
plt.figure(figsize=(12, 6)) 
sns.boxplot(x=datasets['application_train']['AMT_GOODS_PRICE'], palette='Set2') 
plt.xlabel('Goods Price')
plt.title('Boxplot for Goods Price') 
plt.show()

Observation¶

The feature is right skewed. We can observe many outliers in this graph as well. Since the number of outliers looks so high we might loose valuable information if we do any outlier removal here as well.

Summary of Bureau¶

Information about the bureau dataset¶

In [ ]:
datasets["bureau"].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_CURR              int64  
 1   SK_ID_BUREAU            int64  
 2   CREDIT_ACTIVE           object 
 3   CREDIT_CURRENCY         object 
 4   DAYS_CREDIT             int64  
 5   CREDIT_DAY_OVERDUE      int64  
 6   DAYS_CREDIT_ENDDATE     float64
 7   DAYS_ENDDATE_FACT       float64
 8   AMT_CREDIT_MAX_OVERDUE  float64
 9   CNT_CREDIT_PROLONG      int64  
 10  AMT_CREDIT_SUM          float64
 11  AMT_CREDIT_SUM_DEBT     float64
 12  AMT_CREDIT_SUM_LIMIT    float64
 13  AMT_CREDIT_SUM_OVERDUE  float64
 14  CREDIT_TYPE             object 
 15  DAYS_CREDIT_UPDATE      int64  
 16  AMT_ANNUITY             float64
dtypes: float64(8), int64(6), object(3)
memory usage: 222.6+ MB

Statistical summary of the bureau dataset¶

In [ ]:
datasets["bureau"].describe(include='all')
Out[ ]:
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
count 1.716428e+06 1.716428e+06 1716428 1716428 1.716428e+06 1.716428e+06 1.610875e+06 1.082775e+06 5.919400e+05 1.716428e+06 1.716415e+06 1.458759e+06 1.124648e+06 1.716428e+06 1716428 1.716428e+06 4.896370e+05
unique NaN NaN 4 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15 NaN NaN
top NaN NaN Closed currency 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Consumer credit NaN NaN
freq NaN NaN 1079273 1715020 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1251615 NaN NaN
mean 2.782149e+05 5.924434e+06 NaN NaN -1.142108e+03 8.181666e-01 5.105174e+02 -1.017437e+03 3.825418e+03 6.410406e-03 3.549946e+05 1.370851e+05 6.229515e+03 3.791276e+01 NaN -5.937483e+02 1.571276e+04
std 1.029386e+05 5.322657e+05 NaN NaN 7.951649e+02 3.654443e+01 4.994220e+03 7.140106e+02 2.060316e+05 9.622391e-02 1.149811e+06 6.774011e+05 4.503203e+04 5.937650e+03 NaN 7.207473e+02 3.258269e+05
min 1.000010e+05 5.000000e+06 NaN NaN -2.922000e+03 0.000000e+00 -4.206000e+04 -4.202300e+04 0.000000e+00 0.000000e+00 0.000000e+00 -4.705600e+06 -5.864061e+05 0.000000e+00 NaN -4.194700e+04 0.000000e+00
25% 1.888668e+05 5.463954e+06 NaN NaN -1.666000e+03 0.000000e+00 -1.138000e+03 -1.489000e+03 0.000000e+00 0.000000e+00 5.130000e+04 0.000000e+00 0.000000e+00 0.000000e+00 NaN -9.080000e+02 0.000000e+00
50% 2.780550e+05 5.926304e+06 NaN NaN -9.870000e+02 0.000000e+00 -3.300000e+02 -8.970000e+02 0.000000e+00 0.000000e+00 1.255185e+05 0.000000e+00 0.000000e+00 0.000000e+00 NaN -3.950000e+02 0.000000e+00
75% 3.674260e+05 6.385681e+06 NaN NaN -4.740000e+02 0.000000e+00 4.740000e+02 -4.250000e+02 0.000000e+00 0.000000e+00 3.150000e+05 4.015350e+04 0.000000e+00 0.000000e+00 NaN -3.300000e+01 1.350000e+04
max 4.562550e+05 6.843457e+06 NaN NaN 0.000000e+00 2.792000e+03 3.119900e+04 0.000000e+00 1.159872e+08 9.000000e+00 5.850000e+08 1.701000e+08 4.705600e+06 3.756681e+06 NaN 3.720000e+02 1.184534e+08

Percentage and count of missing values in each column of the bureau dataset¶

In [ ]:
percent = (datasets["bureau"].isnull().sum()/datasets["bureau"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["bureau"].isna().sum().sort_values(ascending = False)
total_count_with = datasets["bureau"].isnull().count()
total_count_without = datasets["bureau"].count()
missing_application_train_data  = pd.concat([percent, sum_missing, total_count_without, total_count_with], axis=1, keys=['Percent', 'Missing Count', 'Total Count without Missing value', 'Total Count'])
missing_application_train_data
Out[ ]:
Percent Missing Count Total Count without Missing value Total Count
AMT_ANNUITY 71.47 1226791 489637 1716428
AMT_CREDIT_MAX_OVERDUE 65.51 1124488 591940 1716428
DAYS_ENDDATE_FACT 36.92 633653 1082775 1716428
AMT_CREDIT_SUM_LIMIT 34.48 591780 1124648 1716428
AMT_CREDIT_SUM_DEBT 15.01 257669 1458759 1716428
DAYS_CREDIT_ENDDATE 6.15 105553 1610875 1716428
AMT_CREDIT_SUM 0.00 13 1716415 1716428
CREDIT_ACTIVE 0.00 0 1716428 1716428
CREDIT_CURRENCY 0.00 0 1716428 1716428
DAYS_CREDIT 0.00 0 1716428 1716428
CREDIT_DAY_OVERDUE 0.00 0 1716428 1716428
SK_ID_BUREAU 0.00 0 1716428 1716428
CNT_CREDIT_PROLONG 0.00 0 1716428 1716428
AMT_CREDIT_SUM_OVERDUE 0.00 0 1716428 1716428
CREDIT_TYPE 0.00 0 1716428 1716428
DAYS_CREDIT_UPDATE 0.00 0 1716428 1716428
SK_ID_CURR 0.00 0 1716428 1716428

Bar plot to visualize the percentage of missing data for each features of bureau dataset¶

In [ ]:
percent = (datasets['bureau'].isnull().sum()/datasets['bureau'].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets['bureau'].isna().sum().sort_values(ascending = False)
missing_data  = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Missing Count"])
missing_data=missing_data[missing_data['Percent'] > 0]
f, ax = plt.subplots()
plt.title(f'Missing data for bureau (%)', fontsize=11)
fig=sns.barplot(x = missing_data["Percent"],y = missing_data.index ,alpha=0.8)
plt.xlabel('Missing values (%)', fontsize=10)
plt.ylabel('Features', fontsize=10)
Out[ ]:
Text(0, 0.5, 'Features')

Merging bureau and application_train based on SK_ID_CURR¶

In [ ]:
bureau_df = datasets['bureau']
application_train_df = datasets['application_train']
merged_df = application_train_df[['SK_ID_CURR', 'TARGET']].merge(bureau_df, on='SK_ID_CURR', how='left')

merged_df.head(5)
Out[ ]:
SK_ID_CURR TARGET SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 100002 1 6158904.0 Closed currency 1 -1125.0 0.0 -1038.0 -1038.0 NaN 0.0 40761.0 NaN NaN 0.0 Credit card -1038.0 0.0
1 100002 1 6158905.0 Closed currency 1 -476.0 0.0 NaN -48.0 NaN 0.0 0.0 0.0 NaN 0.0 Credit card -47.0 NaN
2 100002 1 6158906.0 Closed currency 1 -1437.0 0.0 -1072.0 -1185.0 0.000 0.0 135000.0 0.0 0.0 0.0 Consumer credit -1185.0 0.0
3 100002 1 6158907.0 Closed currency 1 -1121.0 0.0 -911.0 -911.0 3321.000 0.0 19071.0 NaN NaN 0.0 Consumer credit -906.0 0.0
4 100002 1 6158908.0 Closed currency 1 -645.0 0.0 85.0 -36.0 5043.645 0.0 120735.0 0.0 0.0 0.0 Consumer credit -34.0 0.0

Correlation matrix for merged dataframe¶

In [ ]:
fig = plt.figure(figsize=(15,10))
sns.heatmap(merged_df.corr(), annot=True)
Out[ ]:
<Axes: >

Observation:¶

  • The DAYS_CREDIT and DAYS_CREDIT_UPDATE columns from bureau_df have weak positive correlations with the TARGET column from application_train_df. This suggests that as the number of days since the last credit inquiry or the number of days until the end of a credit agreement increases, the probability of the loan not being repaid may increase.

Scatterplot for DAYS_CREDIT and DAYS_ENDDATE_FACT column with respect to the TARGET column¶

In [ ]:
# Removing the outlier which was affecting our plot
merged_df = merged_df[merged_df["SK_ID_CURR"] != 104463]  
plt.figure(figsize=(10, 8))
sns.scatterplot(x = "DAYS_CREDIT", y='DAYS_ENDDATE_FACT', hue='TARGET' ,data=merged_df)
plt.title('Days Since Credit Inquiry vs. Days Since Credit Ended W.R.T Loan Repayment Status')
plt.xlabel('Days Since Credit Inquiry')
plt.ylabel('Days Since Credit Ended')
plt.show()

Observation:¶

  • We can see the points clustering along a diagonal line running from bottom left to the top right of the plot.
  • We can see that as the number of days since the credit inquiry increases, so does the number of days since the credit line ended.
  • Since the plot also shows a large spread of points with varying repayment status, we can say that there is no clear correlation between these variables and the loan repayment behavior

Summary of Bureau Balance¶

Information about the bureau_balance dataset¶

In [ ]:
datasets["bureau_balance"].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   SK_ID_BUREAU    int64 
 1   MONTHS_BALANCE  int64 
 2   STATUS          object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB

Statistical summary of the bureau_balance dataset¶

In [ ]:
datasets["bureau_balance"].describe(include='all')
Out[ ]:
SK_ID_BUREAU MONTHS_BALANCE STATUS
count 2.729992e+07 2.729992e+07 27299925
unique NaN NaN 8
top NaN NaN C
freq NaN NaN 13646993
mean 6.036297e+06 -3.074169e+01 NaN
std 4.923489e+05 2.386451e+01 NaN
min 5.001709e+06 -9.600000e+01 NaN
25% 5.730933e+06 -4.600000e+01 NaN
50% 6.070821e+06 -2.500000e+01 NaN
75% 6.431951e+06 -1.100000e+01 NaN
max 6.842888e+06 0.000000e+00 NaN

Percentage and count of missing values in each column of the bureau_balance dataset¶

In [ ]:
percent = (datasets["bureau_balance"].isnull().sum()/datasets["bureau_balance"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["bureau_balance"].isna().sum().sort_values(ascending = False)
total_count_with = datasets["bureau_balance"].isnull().count()
total_count_without = datasets["bureau_balance"].count()
missing_application_train_data  = pd.concat([percent, sum_missing, total_count_without, total_count_with], axis=1, keys=['Percent', 'Missing Count', 'Total Count without Missing value', 'Total Count'])
missing_application_train_data
Out[ ]:
Percent Missing Count Total Count without Missing value Total Count
SK_ID_BUREAU 0.0 0 27299925 27299925
MONTHS_BALANCE 0.0 0 27299925 27299925
STATUS 0.0 0 27299925 27299925

Merging bureau_balance and merged bureau/application_train based on SK_ID_BUREAU¶

In [ ]:
bureau_balance_df = datasets['bureau_balance']
bureau_balance_merged_df = merged_df[['SK_ID_BUREAU', 'TARGET']].merge(bureau_balance_df, on='SK_ID_BUREAU', how='left')

bureau_balance_merged_df.head(5)
Out[ ]:
SK_ID_BUREAU TARGET MONTHS_BALANCE STATUS
0 6158904.0 1 -15.0 C
1 6158904.0 1 -16.0 C
2 6158904.0 1 -17.0 0
3 6158904.0 1 -18.0 0
4 6158904.0 1 -19.0 0

Correlation matrix for bureau_balance_merged dataframe¶

In [ ]:
fig = plt.figure(figsize=(5,3))
sns.heatmap(bureau_balance_merged_df.corr(), annot=True)
Out[ ]:
<Axes: >

Observation:¶

  • The strongest positive correlation is between the MONTHS_BALANCE column and the TARGET column, which suggests that the target variable may be related to the duration of the credit history

Summary of Credit Card Balance¶

Information about the credit_card_balance dataset¶

In [ ]:
datasets["credit_card_balance"].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Data columns (total 23 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   SK_ID_PREV                  int64  
 1   SK_ID_CURR                  int64  
 2   MONTHS_BALANCE              int64  
 3   AMT_BALANCE                 float64
 4   AMT_CREDIT_LIMIT_ACTUAL     int64  
 5   AMT_DRAWINGS_ATM_CURRENT    float64
 6   AMT_DRAWINGS_CURRENT        float64
 7   AMT_DRAWINGS_OTHER_CURRENT  float64
 8   AMT_DRAWINGS_POS_CURRENT    float64
 9   AMT_INST_MIN_REGULARITY     float64
 10  AMT_PAYMENT_CURRENT         float64
 11  AMT_PAYMENT_TOTAL_CURRENT   float64
 12  AMT_RECEIVABLE_PRINCIPAL    float64
 13  AMT_RECIVABLE               float64
 14  AMT_TOTAL_RECEIVABLE        float64
 15  CNT_DRAWINGS_ATM_CURRENT    float64
 16  CNT_DRAWINGS_CURRENT        int64  
 17  CNT_DRAWINGS_OTHER_CURRENT  float64
 18  CNT_DRAWINGS_POS_CURRENT    float64
 19  CNT_INSTALMENT_MATURE_CUM   float64
 20  NAME_CONTRACT_STATUS        object 
 21  SK_DPD                      int64  
 22  SK_DPD_DEF                  int64  
dtypes: float64(15), int64(7), object(1)
memory usage: 673.9+ MB

Statistical summary of the credit_card_balance dataset¶

In [ ]:
datasets["credit_card_balance"].describe(include='all')
Out[ ]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY AMT_PAYMENT_CURRENT AMT_PAYMENT_TOTAL_CURRENT AMT_RECEIVABLE_PRINCIPAL AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
count 3.840312e+06 3.840312e+06 3.840312e+06 3.840312e+06 3.840312e+06 3.090496e+06 3.840312e+06 3.090496e+06 3.090496e+06 3.535076e+06 3.072324e+06 3.840312e+06 3.840312e+06 3.840312e+06 3.840312e+06 3.090496e+06 3.840312e+06 3.090496e+06 3.090496e+06 3.535076e+06 3840312 3.840312e+06 3.840312e+06
unique NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7 NaN NaN
top NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Active NaN NaN
freq NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3698436 NaN NaN
mean 1.904504e+06 2.783242e+05 -3.452192e+01 5.830016e+04 1.538080e+05 5.961325e+03 7.433388e+03 2.881696e+02 2.968805e+03 3.540204e+03 1.028054e+04 7.588857e+03 5.596588e+04 5.808881e+04 5.809829e+04 3.094490e-01 7.031439e-01 4.812496e-03 5.594791e-01 2.082508e+01 NaN 9.283667e+00 3.316220e-01
std 5.364695e+05 1.027045e+05 2.666775e+01 1.063070e+05 1.651457e+05 2.822569e+04 3.384608e+04 8.201989e+03 2.079689e+04 5.600154e+03 3.607808e+04 3.200599e+04 1.025336e+05 1.059654e+05 1.059718e+05 1.100401e+00 3.190347e+00 8.263861e-02 3.240649e+00 2.005149e+01 NaN 9.751570e+01 2.147923e+01
min 1.000018e+06 1.000060e+05 -9.600000e+01 -4.202502e+05 0.000000e+00 -6.827310e+03 -6.211620e+03 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 -4.233058e+05 -4.202502e+05 -4.202502e+05 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 NaN 0.000000e+00 0.000000e+00
25% 1.434385e+06 1.895170e+05 -5.500000e+01 0.000000e+00 4.500000e+04 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.523700e+02 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e+00 NaN 0.000000e+00 0.000000e+00
50% 1.897122e+06 2.783960e+05 -2.800000e+01 0.000000e+00 1.125000e+05 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.702700e+03 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.500000e+01 NaN 0.000000e+00 0.000000e+00
75% 2.369328e+06 3.675800e+05 -1.100000e+01 8.904669e+04 1.800000e+05 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.633911e+03 9.000000e+03 6.750000e+03 8.535924e+04 8.889949e+04 8.891451e+04 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.200000e+01 NaN 0.000000e+00 0.000000e+00
max 2.843496e+06 4.562500e+05 -1.000000e+00 1.505902e+06 1.350000e+06 2.115000e+06 2.287098e+06 1.529847e+06 2.239274e+06 2.028820e+05 4.289207e+06 4.278316e+06 1.472317e+06 1.493338e+06 1.493338e+06 5.100000e+01 1.650000e+02 1.200000e+01 1.650000e+02 1.200000e+02 NaN 3.260000e+03 3.260000e+03

Percentage and count of missing values in each column of the credit_card_balance dataset¶

In [ ]:
percent = (datasets["credit_card_balance"].isnull().sum()/datasets["credit_card_balance"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["credit_card_balance"].isna().sum().sort_values(ascending = False)
total_count_with = datasets["credit_card_balance"].isnull().count()
total_count_without = datasets["credit_card_balance"].count()
missing_application_train_data  = pd.concat([percent, sum_missing, total_count_without, total_count_with], axis=1, keys=['Percent', 'Missing Count', 'Total Count without Missing value', 'Total Count'])
missing_application_train_data
Out[ ]:
Percent Missing Count Total Count without Missing value Total Count
AMT_PAYMENT_CURRENT 20.00 767988 3072324 3840312
AMT_DRAWINGS_ATM_CURRENT 19.52 749816 3090496 3840312
CNT_DRAWINGS_POS_CURRENT 19.52 749816 3090496 3840312
AMT_DRAWINGS_OTHER_CURRENT 19.52 749816 3090496 3840312
AMT_DRAWINGS_POS_CURRENT 19.52 749816 3090496 3840312
CNT_DRAWINGS_OTHER_CURRENT 19.52 749816 3090496 3840312
CNT_DRAWINGS_ATM_CURRENT 19.52 749816 3090496 3840312
CNT_INSTALMENT_MATURE_CUM 7.95 305236 3535076 3840312
AMT_INST_MIN_REGULARITY 7.95 305236 3535076 3840312
SK_ID_PREV 0.00 0 3840312 3840312
AMT_TOTAL_RECEIVABLE 0.00 0 3840312 3840312
SK_DPD 0.00 0 3840312 3840312
NAME_CONTRACT_STATUS 0.00 0 3840312 3840312
CNT_DRAWINGS_CURRENT 0.00 0 3840312 3840312
AMT_PAYMENT_TOTAL_CURRENT 0.00 0 3840312 3840312
AMT_RECIVABLE 0.00 0 3840312 3840312
AMT_RECEIVABLE_PRINCIPAL 0.00 0 3840312 3840312
SK_ID_CURR 0.00 0 3840312 3840312
AMT_DRAWINGS_CURRENT 0.00 0 3840312 3840312
AMT_CREDIT_LIMIT_ACTUAL 0.00 0 3840312 3840312
AMT_BALANCE 0.00 0 3840312 3840312
MONTHS_BALANCE 0.00 0 3840312 3840312
SK_DPD_DEF 0.00 0 3840312 3840312

Bar plot to visualize the percentage of missing data for each features of credit_card_balance dataset¶

In [ ]:
percent = (datasets['credit_card_balance'].isnull().sum()/datasets['credit_card_balance'].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets['credit_card_balance'].isna().sum().sort_values(ascending = False)
missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Missing Count"])
missing_data = missing_data[missing_data['Percent'] > 0]
plt.title(f'Missing data for credit_card_balance in percent', fontsize=10)
sns.barplot(x = missing_data["Percent"],y = missing_data.index ,alpha=0.8)
plt.xlabel('Missing values (%)', fontsize=10)
plt.ylabel('Features', fontsize=10)
plt.show()

Histogram of the distribution of the AMT_PAYMENT_TOTAL_CURRENT feature¶

In [ ]:
fig, ax = plt.subplots()
ax.hist(datasets['credit_card_balance']['AMT_PAYMENT_TOTAL_CURRENT'], bins=50)
ax.set_title('Distribution of AMT_PAYMENT_TOTAL_CURRENT')
plt.xlabel("Total Amount in the corresponding month")
plt.ylabel("Count")
ax.tick_params(labelsize=10,length=0)
plt.box(False)
plt.ticklabel_format(style='plain', axis='y')
ax.ticklabel_format(useOffset=False, style='plain')
ax.yaxis.grid(linewidth=0.5,color='black')
ax.set_axisbelow(True)

Observation:¶

  • We can observe that there is a high count of credit card payments with a total amount of 0, which could suggest that many credit card holders either did not make a payment or paid the minimum amount due for that particular month.

Distribution of AMT_CREDIT_LIMIT_ACTUAL feature¶

In [ ]:
fig, ax = plt.subplots()
sns.histplot(data=datasets['credit_card_balance'], x='AMT_CREDIT_LIMIT_ACTUAL', kde=True)
plt.title('Distribution of Credit Card Limits')
plt.xlabel('Credit Card Limit')
plt.ylabel('Count')
ax.ticklabel_format(useOffset=False, style='plain')
plt.show()

Observation:¶

  • We can observe that there is more area between 0 and 200000 may suggest that borrowers tend to have lower outstanding balances and make smaller payments, rather than carrying large balances and making large payments.

Scatter plot to show the relationship between the AMT_CREDIT_LIMIT_ACTUAL and AMT_BALANCE¶

In [ ]:
fig, ax = plt.subplots()
datasets['credit_card_balance']["credit_limit"] = datasets['credit_card_balance']["AMT_CREDIT_LIMIT_ACTUAL"] >= datasets['credit_card_balance']["AMT_BALANCE"]
datasets['credit_card_balance']["credit_limit"] = datasets['credit_card_balance']["credit_limit"].apply(lambda x: "Less spent" if x == True else "More spent")
sns.scatterplot(data=datasets['credit_card_balance'], x='AMT_CREDIT_LIMIT_ACTUAL', y='AMT_DRAWINGS_CURRENT', hue = "credit_limit")
plt.title('Credit Card Limit vs. Amount Spent During the Month')
plt.xlabel('Credit Card Limit')
plt.ylabel('Amount Spent During the Month')
ax.ticklabel_format(useOffset=False, style='plain')
plt.show()

Observation:¶

  • This indicates that in most cases, the credit limit of the cardholders is sufficient to cover the amount spent during the month.

Merging credit_card_balance and application_train based on SK_ID_CURR¶

In [ ]:
merged_df = datasets['application_train'][['SK_ID_CURR', 'TARGET']].merge(datasets['credit_card_balance'], on='SK_ID_CURR', how='left')
merged_df.head(5)
Out[ ]:
SK_ID_CURR TARGET SK_ID_PREV MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY AMT_PAYMENT_CURRENT AMT_PAYMENT_TOTAL_CURRENT AMT_RECEIVABLE_PRINCIPAL AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 100002 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 100003 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 100004 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 100006 0 1489396.0 -2.0 0.0 270000.0 NaN 0.0 NaN NaN 0.0 NaN 0.0 0.0 0.0 0.0 NaN 0.0 NaN NaN 0.0 Active 0.0 0.0
4 100006 0 1489396.0 -1.0 0.0 270000.0 NaN 0.0 NaN NaN 0.0 NaN 0.0 0.0 0.0 0.0 NaN 0.0 NaN NaN 0.0 Active 0.0 0.0

Correlation of TARGET feature with credit_card_balance dataset¶

In [ ]:
merged_df.corr()['TARGET'].sort_values(ascending= False)
Out[ ]:
TARGET                        1.000000
AMT_BALANCE                   0.050098
AMT_TOTAL_RECEIVABLE          0.049839
AMT_RECIVABLE                 0.049803
AMT_RECEIVABLE_PRINCIPAL      0.049692
AMT_INST_MIN_REGULARITY       0.039798
CNT_DRAWINGS_ATM_CURRENT      0.038437
CNT_DRAWINGS_CURRENT          0.037793
MONTHS_BALANCE                0.035695
CNT_DRAWINGS_POS_CURRENT      0.029536
AMT_DRAWINGS_ATM_CURRENT      0.024700
AMT_DRAWINGS_CURRENT          0.022378
AMT_CREDIT_LIMIT_ACTUAL       0.013823
AMT_PAYMENT_CURRENT           0.012929
AMT_PAYMENT_TOTAL_CURRENT     0.012302
SK_DPD_DEF                    0.010538
AMT_DRAWINGS_POS_CURRENT      0.005084
AMT_DRAWINGS_OTHER_CURRENT    0.003843
CNT_DRAWINGS_OTHER_CURRENT    0.003044
SK_ID_PREV                    0.002571
SK_DPD                        0.001684
SK_ID_CURR                   -0.004412
CNT_INSTALMENT_MATURE_CUM    -0.023684
Name: TARGET, dtype: float64

Correlation of credit_card_balance dataset¶

In [ ]:
fig = plt.figure(figsize=(20,15))
sns.heatmap(merged_df.corr(), annot=True)
Out[ ]:
<Axes: >

Observation:¶

  • The insight from the correlation matrix is that there are weak positive correlations between the TARGET column and the credit card features in the merged dataset.

  • The AMT_BALANCE, AMT_TOTAL_RECEIVABLE, AMT_RECIVABLE, and AMT_RECEIVABLE_PRINCIPAL columns have the highest correlation with the TARGET column, indicating that borrowers with higher credit card balances and receivables are more likely to default on their loans.

  • Other features such as AMT_INST_MIN_REGULARITY, CNT_DRAWINGS_ATM_CURRENT, and CNT_DRAWINGS_CURRENT also have a weak correlation with the TARGET column. Based on these correlations, it may be important to consider credit card balance and receivables as key features in predicting loan defaults. This suggests that borrowers who make smaller minimum payments on their credit cards, as well as those who make more ATM withdrawals and overall credit card drawings, may be more likely to have difficulty repaying their loans.

Summary of Previous Application¶

In [ ]:
edaPAdf = pd.DataFrame()
In [ ]:
edaPAdf['SK_ID_CURR'] = datasets['application_train']['SK_ID_CURR']
edaPAdf['TARGET'] = datasets['application_train']['TARGET']

Shape of previous_application¶

In [ ]:
datasets['previous_application'].shape
Out[ ]:
(1670214, 37)

Information about previous_application¶

In [ ]:
datasets['previous_application'].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int64  
 1   SK_ID_CURR                   1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   AMT_GOODS_PRICE              1284699 non-null  float64
 8   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 9   HOUR_APPR_PROCESS_START      1670214 non-null  int64  
 10  FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  object 
 11  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  int64  
 12  RATE_DOWN_PAYMENT            774370 non-null   float64
 13  RATE_INTEREST_PRIMARY        5951 non-null     float64
 14  RATE_INTEREST_PRIVILEGED     5951 non-null     float64
 15  NAME_CASH_LOAN_PURPOSE       1670214 non-null  object 
 16  NAME_CONTRACT_STATUS         1670214 non-null  object 
 17  DAYS_DECISION                1670214 non-null  int64  
 18  NAME_PAYMENT_TYPE            1670214 non-null  object 
 19  CODE_REJECT_REASON           1670214 non-null  object 
 20  NAME_TYPE_SUITE              849809 non-null   object 
 21  NAME_CLIENT_TYPE             1670214 non-null  object 
 22  NAME_GOODS_CATEGORY          1670214 non-null  object 
 23  NAME_PORTFOLIO               1670214 non-null  object 
 24  NAME_PRODUCT_TYPE            1670214 non-null  object 
 25  CHANNEL_TYPE                 1670214 non-null  object 
 26  SELLERPLACE_AREA             1670214 non-null  int64  
 27  NAME_SELLER_INDUSTRY         1670214 non-null  object 
 28  CNT_PAYMENT                  1297984 non-null  float64
 29  NAME_YIELD_GROUP             1670214 non-null  object 
 30  PRODUCT_COMBINATION          1669868 non-null  object 
 31  DAYS_FIRST_DRAWING           997149 non-null   float64
 32  DAYS_FIRST_DUE               997149 non-null   float64
 33  DAYS_LAST_DUE_1ST_VERSION    997149 non-null   float64
 34  DAYS_LAST_DUE                997149 non-null   float64
 35  DAYS_TERMINATION             997149 non-null   float64
 36  NFLAG_INSURED_ON_APPROVAL    997149 non-null   float64
dtypes: float64(15), int64(6), object(16)
memory usage: 471.5+ MB

Statistical summary of previous_application¶

In [ ]:
datasets['previous_application'].describe()
Out[ ]:
SK_ID_PREV SK_ID_CURR AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE HOUR_APPR_PROCESS_START NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED DAYS_DECISION SELLERPLACE_AREA CNT_PAYMENT DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
count 1.670214e+06 1.670214e+06 1.297979e+06 1.670214e+06 1.670213e+06 7.743700e+05 1.284699e+06 1.670214e+06 1.670214e+06 774370.000000 5951.000000 5951.000000 1.670214e+06 1.670214e+06 1.297984e+06 997149.000000 997149.000000 997149.000000 997149.000000 997149.000000 997149.000000
mean 1.923089e+06 2.783572e+05 1.595512e+04 1.752339e+05 1.961140e+05 6.697402e+03 2.278473e+05 1.248418e+01 9.964675e-01 0.079637 0.188357 0.773503 -8.806797e+02 3.139511e+02 1.605408e+01 342209.855039 13826.269337 33767.774054 76582.403064 81992.343838 0.332570
std 5.325980e+05 1.028148e+05 1.478214e+04 2.927798e+05 3.185746e+05 2.092150e+04 3.153966e+05 3.334028e+00 5.932963e-02 0.107823 0.087671 0.100879 7.790997e+02 7.127443e+03 1.456729e+01 88916.115833 72444.869708 106857.034789 149647.415123 153303.516729 0.471134
min 1.000001e+06 1.000010e+05 0.000000e+00 0.000000e+00 0.000000e+00 -9.000000e-01 0.000000e+00 0.000000e+00 0.000000e+00 -0.000015 0.034781 0.373150 -2.922000e+03 -1.000000e+00 0.000000e+00 -2922.000000 -2892.000000 -2801.000000 -2889.000000 -2874.000000 0.000000
25% 1.461857e+06 1.893290e+05 6.321780e+03 1.872000e+04 2.416050e+04 0.000000e+00 5.084100e+04 1.000000e+01 1.000000e+00 0.000000 0.160716 0.715645 -1.300000e+03 -1.000000e+00 6.000000e+00 365243.000000 -1628.000000 -1242.000000 -1314.000000 -1270.000000 0.000000
50% 1.923110e+06 2.787145e+05 1.125000e+04 7.104600e+04 8.054100e+04 1.638000e+03 1.123200e+05 1.200000e+01 1.000000e+00 0.051605 0.189122 0.835095 -5.810000e+02 3.000000e+00 1.200000e+01 365243.000000 -831.000000 -361.000000 -537.000000 -499.000000 0.000000
75% 2.384280e+06 3.675140e+05 2.065842e+04 1.803600e+05 2.164185e+05 7.740000e+03 2.340000e+05 1.500000e+01 1.000000e+00 0.108909 0.193330 0.852537 -2.800000e+02 8.200000e+01 2.400000e+01 365243.000000 -411.000000 129.000000 -74.000000 -44.000000 1.000000
max 2.845382e+06 4.562550e+05 4.180581e+05 6.905160e+06 6.905160e+06 3.060045e+06 6.905160e+06 2.300000e+01 1.000000e+00 1.000000 1.000000 1.000000 -1.000000e+00 4.000000e+06 8.400000e+01 365243.000000 365243.000000 365243.000000 365243.000000 365243.000000 1.000000
In [ ]:
datasets['previous_application'].describe(include='all')
Out[ ]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
count 1.670214e+06 1.670214e+06 1670214 1.297979e+06 1.670214e+06 1.670213e+06 7.743700e+05 1.284699e+06 1670214 1.670214e+06 1670214 1.670214e+06 774370.000000 5951.000000 5951.000000 1670214 1670214 1.670214e+06 1670214 1670214 849809 1670214 1670214 1670214 1670214 1670214 1.670214e+06 1670214 1.297984e+06 1670214 1669868 997149.000000 997149.000000 997149.000000 997149.000000 997149.000000 997149.000000
unique NaN NaN 4 NaN NaN NaN NaN NaN 7 NaN 2 NaN NaN NaN NaN 25 4 NaN 4 9 7 4 28 5 3 8 NaN 11 NaN 5 17 NaN NaN NaN NaN NaN NaN
top NaN NaN Cash loans NaN NaN NaN NaN NaN TUESDAY NaN Y NaN NaN NaN NaN XAP Approved NaN Cash through the bank XAP Unaccompanied Repeater XNA POS XNA Credit and cash offices NaN XNA NaN XNA Cash NaN NaN NaN NaN NaN NaN
freq NaN NaN 747553 NaN NaN NaN NaN NaN 255118 NaN 1661739 NaN NaN NaN NaN 922661 1036781 NaN 1033552 1353093 508970 1231261 950809 691011 1063666 719968 NaN 855720 NaN 517215 285990 NaN NaN NaN NaN NaN NaN
mean 1.923089e+06 2.783572e+05 NaN 1.595512e+04 1.752339e+05 1.961140e+05 6.697402e+03 2.278473e+05 NaN 1.248418e+01 NaN 9.964675e-01 0.079637 0.188357 0.773503 NaN NaN -8.806797e+02 NaN NaN NaN NaN NaN NaN NaN NaN 3.139511e+02 NaN 1.605408e+01 NaN NaN 342209.855039 13826.269337 33767.774054 76582.403064 81992.343838 0.332570
std 5.325980e+05 1.028148e+05 NaN 1.478214e+04 2.927798e+05 3.185746e+05 2.092150e+04 3.153966e+05 NaN 3.334028e+00 NaN 5.932963e-02 0.107823 0.087671 0.100879 NaN NaN 7.790997e+02 NaN NaN NaN NaN NaN NaN NaN NaN 7.127443e+03 NaN 1.456729e+01 NaN NaN 88916.115833 72444.869708 106857.034789 149647.415123 153303.516729 0.471134
min 1.000001e+06 1.000010e+05 NaN 0.000000e+00 0.000000e+00 0.000000e+00 -9.000000e-01 0.000000e+00 NaN 0.000000e+00 NaN 0.000000e+00 -0.000015 0.034781 0.373150 NaN NaN -2.922000e+03 NaN NaN NaN NaN NaN NaN NaN NaN -1.000000e+00 NaN 0.000000e+00 NaN NaN -2922.000000 -2892.000000 -2801.000000 -2889.000000 -2874.000000 0.000000
25% 1.461857e+06 1.893290e+05 NaN 6.321780e+03 1.872000e+04 2.416050e+04 0.000000e+00 5.084100e+04 NaN 1.000000e+01 NaN 1.000000e+00 0.000000 0.160716 0.715645 NaN NaN -1.300000e+03 NaN NaN NaN NaN NaN NaN NaN NaN -1.000000e+00 NaN 6.000000e+00 NaN NaN 365243.000000 -1628.000000 -1242.000000 -1314.000000 -1270.000000 0.000000
50% 1.923110e+06 2.787145e+05 NaN 1.125000e+04 7.104600e+04 8.054100e+04 1.638000e+03 1.123200e+05 NaN 1.200000e+01 NaN 1.000000e+00 0.051605 0.189122 0.835095 NaN NaN -5.810000e+02 NaN NaN NaN NaN NaN NaN NaN NaN 3.000000e+00 NaN 1.200000e+01 NaN NaN 365243.000000 -831.000000 -361.000000 -537.000000 -499.000000 0.000000
75% 2.384280e+06 3.675140e+05 NaN 2.065842e+04 1.803600e+05 2.164185e+05 7.740000e+03 2.340000e+05 NaN 1.500000e+01 NaN 1.000000e+00 0.108909 0.193330 0.852537 NaN NaN -2.800000e+02 NaN NaN NaN NaN NaN NaN NaN NaN 8.200000e+01 NaN 2.400000e+01 NaN NaN 365243.000000 -411.000000 129.000000 -74.000000 -44.000000 1.000000
max 2.845382e+06 4.562550e+05 NaN 4.180581e+05 6.905160e+06 6.905160e+06 3.060045e+06 6.905160e+06 NaN 2.300000e+01 NaN 1.000000e+00 1.000000 1.000000 1.000000 NaN NaN -1.000000e+00 NaN NaN NaN NaN NaN NaN NaN NaN 4.000000e+06 NaN 8.400000e+01 NaN NaN 365243.000000 365243.000000 365243.000000 365243.000000 365243.000000 1.000000

Missing Values for Previous Application¶

In [ ]:
edaPAdf = edaPAdf.merge(datasets["previous_application"], how='left', on='SK_ID_CURR')
In [ ]:
percent = (edaPAdf.isnull().sum()/edaPAdf.isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = edaPAdf.isna().sum().sort_values(ascending = False)
missing_previous_application_data  = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Missing Count"])
missing_previous_application_data.head(20)
Out[ ]:
Percent Missing Count
RATE_INTEREST_PRIMARY 99.67 1425364
RATE_INTEREST_PRIVILEGED 99.67 1425364
AMT_DOWN_PAYMENT 53.56 765994
RATE_DOWN_PAYMENT 53.56 765994
NAME_TYPE_SUITE 49.72 711126
NFLAG_INSURED_ON_APPROVAL 40.38 577560
DAYS_TERMINATION 40.38 577560
DAYS_LAST_DUE 40.38 577560
DAYS_LAST_DUE_1ST_VERSION 40.38 577560
DAYS_FIRST_DUE 40.38 577560
DAYS_FIRST_DRAWING 40.38 577560
AMT_GOODS_PRICE 23.49 335979
AMT_ANNUITY 22.63 323672
CNT_PAYMENT 22.63 323667
PRODUCT_COMBINATION 1.17 16767
AMT_CREDIT 1.15 16455
NAME_YIELD_GROUP 1.15 16454
NAME_GOODS_CATEGORY 1.15 16454
NAME_SELLER_INDUSTRY 1.15 16454
SELLERPLACE_AREA 1.15 16454
In [ ]:
# Filter rows with count of missing values greater than 0
missing_previous_application_data = missing_previous_application_data[missing_previous_application_data['Missing Count'] > 0]
print("Number of columns with missing value greater than 0:",len(missing_previous_application_data))
missing_more_than_60=missing_previous_application_data[missing_previous_application_data['Percent'] >=60]
print("Number of columns with more than 60% missing values:", len(missing_more_than_60))

plt.figure(figsize=(15, 20))

sns.set_style("whitegrid")
ax = sns.barplot(x="Missing Count", y=missing_previous_application_data.index, data=missing_previous_application_data,
                 palette="rocket")

# Set the figure size and axis labels

ax.set_xlabel("Number of Missing Values ")
ax.set_ylabel("Features")

plt.title('Missing values in previous application columns')

# Add values to the bars
for i, v in enumerate(missing_previous_application_data['Missing Count']):
    ax.text(v + 10, i + .25, str(v), color='black', fontweight='bold')

# Show the plot
plt.show()
Number of columns with missing value greater than 0: 36
Number of columns with more than 60% missing values: 2

Observation¶

4 columns have more than 50% of the data missing, we can see how to deal with these columns as two of these 4 columns have 99% data missing and it would make no sense to keep these columns but we will check it out in the next phase about what could be done and if there are any alternatives that we can consider.

Correlation with the Target column¶

In [ ]:
edaPAdf.corr()['TARGET']
Out[ ]:
SK_ID_CURR                  -0.001216
TARGET                       1.000000
SK_ID_PREV                   0.002009
AMT_ANNUITY                 -0.014922
AMT_APPLICATION             -0.005583
AMT_CREDIT                  -0.002350
AMT_DOWN_PAYMENT            -0.016918
AMT_GOODS_PRICE              0.000254
HOUR_APPR_PROCESS_START     -0.027809
NFLAG_LAST_APPL_IN_DAY      -0.002887
RATE_DOWN_PAYMENT           -0.026111
RATE_INTEREST_PRIMARY       -0.001470
RATE_INTEREST_PRIVILEGED     0.028640
DAYS_DECISION                0.039901
SELLERPLACE_AREA            -0.002539
CNT_PAYMENT                  0.030480
DAYS_FIRST_DRAWING          -0.031154
DAYS_FIRST_DUE              -0.006651
DAYS_LAST_DUE_1ST_VERSION    0.018021
DAYS_LAST_DUE                0.017522
DAYS_TERMINATION             0.016981
NFLAG_INSURED_ON_APPROVAL    0.000653
Name: TARGET, dtype: float64
In [ ]:
correlations = edaPAdf.corr()
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(correlations, cmap = cmap)
Out[ ]:
<Axes: >

Contact type of the previous applications¶

In [ ]:
fig, ax = plt.subplots(figsize=(6, 6))

contract_type_counts = edaPAdf['NAME_CONTRACT_TYPE'].value_counts()
contract_type_labels = contract_type_counts.index
contract_type_sizes = contract_type_counts.values
colors = ["#73A4CA", "#497AA7", "#2E5B88"]
ax.pie(contract_type_sizes, labels=contract_type_labels, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 12}, colors=colors)
# Add a common title for the figure
ax.set_title('Contract type of the previous applications', fontsize=16)

# Set the aspect ratio to "equal" and the x and y limits to the same value to make it a perfect circle
ax.set_aspect('equal')
ax.set_xlim([-1.1, 1.1])
ax.set_ylim([-1.1, 1.1])

# Show the plot
plt.show()

Observation¶

  • It can be observed that in the current applications only two types of loans were being opted for which are cash and revolving loans but when we check the records for previous applications we can see that there is another type of loan called consumer loans that probably are not opted for anymore.
  • Also even during previous loans if we compare the percentage of cash to revolving loans cash loans are almost 4 times more opted for than the revolving loans.

Contract Status of the previous application¶

In [ ]:
sns.countplot(x='NAME_CONTRACT_STATUS', data=edaPAdf, palette = "Set3");
plt.title('Contract status of the previous applicant');
plt.xlabel('Contract Status')
plt.xticks(rotation=90);

Observation¶

From the graph we can deduce that atleast 60% of the applications got approved.

Was the client old or new¶

In [ ]:
fig, axs = plt.subplots(ncols=1, figsize=(12, 6))
sns.countplot(x='NAME_CLIENT_TYPE', hue='TARGET', data=edaPAdf, palette = 'rocket')
axs.set_title('Client old or new vs Target')
plt.xlabel('Client Type')
axs.set_ylabel('Count')
plt.show()

Observation¶

We can observe that we dont have many new clients but have a lot of repeater clients and most of these repeater clients got their loan approved.

Final credit amount on the previous application¶

In [ ]:
plt.figure(figsize=(12, 6)) 
sns.distplot(x=edaPAdf['AMT_CREDIT'], color='maroon') 
plt.xlabel('Credit')
plt.title('Distribution plot for Credit') 
plt.show()

Summary of POS Cash Balance¶

In [ ]:
edaPCBdf = pd.DataFrame()
In [ ]:
edaPCBdf['SK_ID_CURR'] = datasets['application_train']['SK_ID_CURR']
edaPCBdf['TARGET'] = datasets['application_train']['TARGET']

Shape of POS_CASH_balance¶

In [ ]:
datasets['POS_CASH_balance'].shape
Out[ ]:
(10001358, 8)

Information about POS_CASH_balance¶

In [ ]:
datasets['POS_CASH_balance'].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   SK_ID_PREV             int64  
 1   SK_ID_CURR             int64  
 2   MONTHS_BALANCE         int64  
 3   CNT_INSTALMENT         float64
 4   CNT_INSTALMENT_FUTURE  float64
 5   NAME_CONTRACT_STATUS   object 
 6   SK_DPD                 int64  
 7   SK_DPD_DEF             int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 610.4+ MB

Statistical summary of POS_CASH_balance¶

In [ ]:
datasets['POS_CASH_balance'].describe()
Out[ ]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE SK_DPD SK_DPD_DEF
count 1.000136e+07 1.000136e+07 1.000136e+07 9.975287e+06 9.975271e+06 1.000136e+07 1.000136e+07
mean 1.903217e+06 2.784039e+05 -3.501259e+01 1.708965e+01 1.048384e+01 1.160693e+01 6.544684e-01
std 5.358465e+05 1.027637e+05 2.606657e+01 1.199506e+01 1.110906e+01 1.327140e+02 3.276249e+01
min 1.000001e+06 1.000010e+05 -9.600000e+01 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 1.434405e+06 1.895500e+05 -5.400000e+01 1.000000e+01 3.000000e+00 0.000000e+00 0.000000e+00
50% 1.896565e+06 2.786540e+05 -2.800000e+01 1.200000e+01 7.000000e+00 0.000000e+00 0.000000e+00
75% 2.368963e+06 3.674290e+05 -1.300000e+01 2.400000e+01 1.400000e+01 0.000000e+00 0.000000e+00
max 2.843499e+06 4.562550e+05 -1.000000e+00 9.200000e+01 8.500000e+01 4.231000e+03 3.595000e+03
In [ ]:
datasets['POS_CASH_balance'].describe(include='all')
Out[ ]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
count 1.000136e+07 1.000136e+07 1.000136e+07 9.975287e+06 9.975271e+06 10001358 1.000136e+07 1.000136e+07
unique NaN NaN NaN NaN NaN 9 NaN NaN
top NaN NaN NaN NaN NaN Active NaN NaN
freq NaN NaN NaN NaN NaN 9151119 NaN NaN
mean 1.903217e+06 2.784039e+05 -3.501259e+01 1.708965e+01 1.048384e+01 NaN 1.160693e+01 6.544684e-01
std 5.358465e+05 1.027637e+05 2.606657e+01 1.199506e+01 1.110906e+01 NaN 1.327140e+02 3.276249e+01
min 1.000001e+06 1.000010e+05 -9.600000e+01 1.000000e+00 0.000000e+00 NaN 0.000000e+00 0.000000e+00
25% 1.434405e+06 1.895500e+05 -5.400000e+01 1.000000e+01 3.000000e+00 NaN 0.000000e+00 0.000000e+00
50% 1.896565e+06 2.786540e+05 -2.800000e+01 1.200000e+01 7.000000e+00 NaN 0.000000e+00 0.000000e+00
75% 2.368963e+06 3.674290e+05 -1.300000e+01 2.400000e+01 1.400000e+01 NaN 0.000000e+00 0.000000e+00
max 2.843499e+06 4.562550e+05 -1.000000e+00 9.200000e+01 8.500000e+01 NaN 4.231000e+03 3.595000e+03

Missing Values for POS CASH Balance¶

In [ ]:
edaPCBdf = edaPCBdf.merge(datasets["POS_CASH_balance"], how='left', on='SK_ID_CURR')
In [ ]:
percent = (edaPCBdf.isnull().sum()/edaPCBdf.isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = edaPCBdf.isna().sum().sort_values(ascending = False)
missing_POS_CASH_balance_data  = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Missing Count"])
missing_POS_CASH_balance_data.head(20)
Out[ ]:
Percent Missing Count
CNT_INSTALMENT_FUTURE 0.47 39945
CNT_INSTALMENT 0.47 39930
SK_ID_PREV 0.21 18067
MONTHS_BALANCE 0.21 18067
NAME_CONTRACT_STATUS 0.21 18067
SK_DPD 0.21 18067
SK_DPD_DEF 0.21 18067
SK_ID_CURR 0.00 0
TARGET 0.00 0
In [ ]:
# Filter rows with count of missing values greater than 0
missing_POS_CASH_balance_data = missing_POS_CASH_balance_data[missing_POS_CASH_balance_data['Missing Count'] > 0]
print("Number of columns with missing value greater than 0:",len(missing_POS_CASH_balance_data))
missing_more_than_60=missing_POS_CASH_balance_data[missing_POS_CASH_balance_data['Percent'] >=60]
print("Number of columns with more than 60% missing values:", len(missing_more_than_60))

plt.figure(figsize=(10, 6))

sns.set_style("whitegrid")
ax = sns.barplot(x="Missing Count", y=missing_POS_CASH_balance_data.index, data=missing_POS_CASH_balance_data,
                 palette="rocket")

# Set the figure size and axis labels

ax.set_xlabel("Number of Missing Values ")
ax.set_ylabel("Features")

plt.title('Missing values in POS CASH Balance columns')

# Add values to the bars
for i, v in enumerate(missing_POS_CASH_balance_data['Missing Count']):
    ax.text(v + 10, i + .25, str(v), color='black', fontweight='bold')

# Show the plot
plt.show()
Number of columns with missing value greater than 0: 7
Number of columns with more than 60% missing values: 0

Observation¶

There are barely any missing values in this dataset.

Correlation with the Target column¶

In [ ]:
edaPCBdf.corr()['TARGET']
Out[ ]:
SK_ID_CURR              -0.002244
TARGET                   1.000000
SK_ID_PREV              -0.000056
MONTHS_BALANCE           0.020147
CNT_INSTALMENT           0.018506
CNT_INSTALMENT_FUTURE    0.021972
SK_DPD                   0.009866
SK_DPD_DEF               0.008594
Name: TARGET, dtype: float64
In [ ]:
correlations = edaPCBdf.corr()
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(correlations, cmap = cmap)
Out[ ]:
<Axes: >

Installments left to pay on previous credit vs Target¶

In [ ]:
fig, axs = plt.subplots(ncols=1, figsize=(10, 25))
sns.countplot(y='CNT_INSTALMENT_FUTURE', hue='TARGET', data=edaPCBdf, palette = 'Set1')
axs.set_title('Installments left to pay vs Target')
axs.set_xlabel('Count')
axs.set_ylabel('Installments left to pay')
plt.show()

Observation¶

Most of the clients have less than 10 installments left to pay and these people have the highest rate of repayment of their loans.

Contract status during the month¶

In [ ]:
fig, axs = plt.subplots(ncols=1, figsize=(15, 7))
sns.countplot(x='NAME_CONTRACT_STATUS', data=edaPCBdf, palette = 'crest')
axs.set_title('Contract status during the month')
axs.set_ylabel('Count')
plt.show()

Observation¶

Almost 90% of the contracts are active and around 9% are completed and 1% are just signed. (percentages are just written based on the visual pov)

Summary of Installments Payments¶

Information about the installments_payments dataset¶

In [ ]:
datasets["installments_payments"].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13605401 entries, 0 to 13605400
Data columns (total 8 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_PREV              int64  
 1   SK_ID_CURR              int64  
 2   NUM_INSTALMENT_VERSION  float64
 3   NUM_INSTALMENT_NUMBER   int64  
 4   DAYS_INSTALMENT         float64
 5   DAYS_ENTRY_PAYMENT      float64
 6   AMT_INSTALMENT          float64
 7   AMT_PAYMENT             float64
dtypes: float64(5), int64(3)
memory usage: 830.4 MB

Statistical summary of the installments_payments dataset¶

In [ ]:
datasets["installments_payments"].describe(include='all')
Out[ ]:
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
count 1.360540e+07 1.360540e+07 1.360540e+07 1.360540e+07 1.360540e+07 1.360250e+07 1.360540e+07 1.360250e+07
mean 1.903365e+06 2.784449e+05 8.566373e-01 1.887090e+01 -1.042270e+03 -1.051114e+03 1.705091e+04 1.723822e+04
std 5.362029e+05 1.027183e+05 1.035216e+00 2.666407e+01 8.009463e+02 8.005859e+02 5.057025e+04 5.473578e+04
min 1.000001e+06 1.000010e+05 0.000000e+00 1.000000e+00 -2.922000e+03 -4.921000e+03 0.000000e+00 0.000000e+00
25% 1.434191e+06 1.896390e+05 0.000000e+00 4.000000e+00 -1.654000e+03 -1.662000e+03 4.226085e+03 3.398265e+03
50% 1.896520e+06 2.786850e+05 1.000000e+00 8.000000e+00 -8.180000e+02 -8.270000e+02 8.884080e+03 8.125515e+03
75% 2.369094e+06 3.675300e+05 1.000000e+00 1.900000e+01 -3.610000e+02 -3.700000e+02 1.671021e+04 1.610842e+04
max 2.843499e+06 4.562550e+05 1.780000e+02 2.770000e+02 -1.000000e+00 -1.000000e+00 3.771488e+06 3.771488e+06

Percentage and count of missing values in each column of the installments_payments dataset¶

In [ ]:
percent = (datasets["installments_payments"].isnull().sum()/datasets["installments_payments"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["installments_payments"].isna().sum().sort_values(ascending = False)
total_count_with = datasets["installments_payments"].isnull().count()
total_count_without = datasets["installments_payments"].count()
missing_application_train_data  = pd.concat([percent, sum_missing, total_count_without, total_count_with], axis=1, keys=['Percent', 'Missing Count', 'Total Count without Missing value', 'Total Count'])
missing_application_train_data
Out[ ]:
Percent Missing Count Total Count without Missing value Total Count
DAYS_ENTRY_PAYMENT 0.02 2905 13602496 13605401
AMT_PAYMENT 0.02 2905 13602496 13605401
SK_ID_PREV 0.00 0 13605401 13605401
SK_ID_CURR 0.00 0 13605401 13605401
NUM_INSTALMENT_VERSION 0.00 0 13605401 13605401
NUM_INSTALMENT_NUMBER 0.00 0 13605401 13605401
DAYS_INSTALMENT 0.00 0 13605401 13605401
AMT_INSTALMENT 0.00 0 13605401 13605401

Bar plot to visualize the percentage of missing data for each features of installments_payments dataset¶

In [ ]:
percent = (datasets['installments_payments'].isnull().sum()/datasets['installments_payments'].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets['installments_payments'].isna().sum().sort_values(ascending = False)
missing_data  = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Missing Count"])
missing_data=missing_data[missing_data['Percent'] > 0]
fig, ax = plt.subplots()
plt.title(f'Missing data for installments_payments in percent', fontsize=10)
sns.barplot(x = missing_data["Percent"],y = missing_data.index ,alpha=0.8)
plt.xlabel('Missing values (%)', fontsize=10)
plt.ylabel('Features', fontsize=10)
Out[ ]:
Text(0, 0.5, 'Features')

Pie chart to show the relationship between the AMT_PAYMENT and AMT_INSTALMENT¶

In [ ]:
inst_pay = pd.DataFrame()
inst_pay['AMT_DIFF'] = datasets["installments_payments"]['AMT_INSTALMENT'] - datasets["installments_payments"]['AMT_PAYMENT']
inst_pay['OVER_PAYMENT'] = inst_pay['AMT_DIFF'].apply(lambda x: 1 if x < 0 else 0)
inst_pay['PARTIAL_PAYMENT'] = inst_pay['AMT_DIFF'].apply(lambda x: 1 if x > 0 else 0)
inst_pay['FULL_PAYMENT'] = inst_pay['AMT_DIFF'].apply(lambda x: 1 if x == 0 else 0)

full_payment_pct = inst_pay['FULL_PAYMENT'].sum() / inst_pay.shape[0] * 100
partial_payment_pct = inst_pay['PARTIAL_PAYMENT'].sum() / inst_pay.shape[0] * 100
over_payment_pct = inst_pay['OVER_PAYMENT'].sum() / inst_pay.shape[0] * 100

labels = ['Full Payment', 'Partial Payment', 'Over Payment']
x = [full_payment_pct, partial_payment_pct, over_payment_pct]
colors = ['#FF671F','#046A38','#06038D']
explode = (0, 0.05, -0.02)
wedge_properties = {"edgecolor":"k",'width': .25}

plt.pie(x, labels=labels, explode=explode, colors=colors,
        startangle=150, counterclock=False, shadow=True,
        wedgeprops=wedge_properties, autopct="%1.1f%%")
plt.title("Distribution of Installment Payment Scenarios")
plt.show()

Observation:¶

  • This suggests that most borrowers are able to make their payments in full, while only a small percentage of payments are overpaid or underpaid.

Bar plot to show the total number of counts between the AMT_PAYMENT and AMT_INSTALMENT¶

In [ ]:
payment_counts = [inst_pay['FULL_PAYMENT'].sum(), inst_pay['PARTIAL_PAYMENT'].sum(), inst_pay['OVER_PAYMENT'].sum()]
payment_labels = ['Full Payment', 'Partial Payment', 'Over Payment']
columns = {
 'Payment Scenarios': payment_labels,
 'Count': payment_counts
}
df = pd.DataFrame(columns)

fig, ax = plt.subplots()
sns.barplot(x = 'Payment Scenarios', y = 'Count', data = df, ci=95, ax=ax, palette = 'plasma')
ax.set_title('Installment Payment Scenarios')
ax.tick_params(labelsize=10,length=0)
plt.box(False)
plt.ticklabel_format(style='plain', axis='y')
ax.yaxis.grid(linewidth=0.5,color='black')
ax.set_axisbelow(True)

for index, row in df.iterrows():
    ax.text(row.name, row.Count, round(row.Count), color='black', ha='center')

Observation:¶

  • The majority of borrowers made full payments, indicating a strong repayment capacity, whereas a very small percentage of payments were overpaid or underpaid, reflecting that the borrowers were able to manage their repayments efficiently.

Distribution of NUM_INSTALMENT_VERSION feature¶

In [ ]:
installment_versions = datasets["installments_payments"]["NUM_INSTALMENT_VERSION"].value_counts()
installment_versions = installment_versions[installment_versions > 1000]
columns = {
 'Installment Versions': installment_versions.index.astype('int64'),
 'Count': installment_versions.values
}
df = pd.DataFrame(columns)
fig, ax = plt.subplots()
sns.barplot(x = 'Installment Versions', y = 'Count', data = df)
ax.set_title('Count of Different Installment Versions')
plt.xticks(rotation=90)
ax.tick_params(labelsize=10,length=0)
plt.box(False)
plt.ticklabel_format(style='plain', axis='y')
ax.yaxis.grid(linewidth=0.5,color='black')
ax.set_axisbelow(True)

Observation:¶

  • The installment version 1 has the highest count of more than 8 million, followed by version 0 with a count of around 4 million. This indicates that the majority of the installment payments have been made with version 0 and 1, while other installment versions have relatively low counts.

Distribution of NUM_INSTALMENT_NUMBER feature¶

In [ ]:
total_payments = datasets["installments_payments"]["AMT_PAYMENT"].sum()
payments_by_installment = datasets["installments_payments"].groupby("NUM_INSTALMENT_NUMBER")["AMT_PAYMENT"].sum()
percentages = payments_by_installment / total_payments * 100
percentages = percentages[percentages > 2.0]
plt.pie(percentages.values, labels=percentages.index, autopct='%1.1f%%')
plt.title("Percentage of Total Payments Made by Installment Number")
plt.show()

Observation:¶

  • The insight from the observation is that most borrowers were able to make their payments in a timely manner within the first few installments, which could indicate good financial management on their part.
  • It could also suggest that the loan terms and repayment schedules were structured in a way that made it feasible for borrowers to make timely payments.

Merging installments_payments and application_train based on SK_ID_CURR¶

In [ ]:
merged_df = datasets['application_train'][['SK_ID_CURR', 'TARGET']].merge(datasets['installments_payments'], on='SK_ID_CURR', how='left')
merged_df.head(5)
Out[ ]:
SK_ID_CURR TARGET SK_ID_PREV NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
0 100002 1 1038818.0 1.0 16.0 -115.0 -133.0 9251.775 9251.775
1 100002 1 1038818.0 1.0 8.0 -355.0 -375.0 9251.775 9251.775
2 100002 1 1038818.0 2.0 19.0 -25.0 -49.0 53093.745 53093.745
3 100002 1 1038818.0 1.0 12.0 -235.0 -250.0 9251.775 9251.775
4 100002 1 1038818.0 1.0 9.0 -325.0 -344.0 9251.775 9251.775

Correlation of TARGET feature with installments_payments dataset¶

In [ ]:
merged_df.corr()['TARGET'].abs().sort_values(ascending= False)
Out[ ]:
TARGET                    1.000000
DAYS_ENTRY_PAYMENT        0.035122
DAYS_INSTALMENT           0.034974
NUM_INSTALMENT_NUMBER     0.016190
NUM_INSTALMENT_VERSION    0.009896
AMT_PAYMENT               0.003623
SK_ID_CURR                0.002533
AMT_INSTALMENT            0.001498
SK_ID_PREV                0.000212
Name: TARGET, dtype: float64

Heatmap of installments_payments dataset with TARGET feature¶

In [ ]:
fig = plt.figure(figsize=(12,10))
sns.heatmap(merged_df.corr(), annot=True)
Out[ ]:
<Axes: >

Observation:¶

  • There is no strong correlation between the TARGET column and any of the other columns in the merged dataset of application_train and installments_payments files.
  • The NUM_INSTALMENT_NUMBER and NUM_INSTALMENT_VERSION columns also have weak correlations with the TARGET column, which indicates that borrowers who have more installment payments or installment versions are slightly more likely to have difficulty repaying the loan.
  • However, these correlations are relatively weak compared to the DAYS_ENTRY_PAYMENT and DAYS_INSTALMENT columns. The rest of the columns have either no or very weak correlation with the TARGET column.

Dataset questions¶

Unique record for each SK_ID_CURR¶

In [ ]:
list(datasets.keys())
Out[ ]:
['application_train',
 'application_test',
 'bureau',
 'bureau_balance',
 'credit_card_balance',
 'installments_payments',
 'previous_application',
 'POS_CASH_balance']
In [ ]:
len(datasets["application_train"]["SK_ID_CURR"].unique()) == datasets["application_train"].shape[0]
Out[ ]:
True
In [ ]:
# is there an overlap between the test and train customers 
np.intersect1d(datasets["application_train"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])
Out[ ]:
array([], dtype=int64)
In [ ]:
# 
datasets["application_test"].shape
Out[ ]:
(48744, 118)
In [ ]:
datasets["application_train"].shape
Out[ ]:
(307511, 119)

previous applications for the submission file¶

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

In [ ]:
appsDF = datasets["previous_application"]
display(appsDF.head())
print(f"{appsDF.shape[0]:,} rows, {appsDF.shape[1]:,} columns")
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15 Y 1 0.0 0.182832 0.867336 XAP Approved -73 Cash through the bank XAP NaN Repeater Mobile POS XNA Country-wide 35 Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11 Y 1 NaN NaN NaN XNA Approved -164 XNA XAP Unaccompanied Repeater XNA Cash x-sell Contact center -1 XNA 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11 Y 1 NaN NaN NaN XNA Approved -301 Cash through the bank XAP Spouse, partner Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7 Y 1 NaN NaN NaN XNA Approved -512 Cash through the bank XAP NaN Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9 Y 1 NaN NaN NaN Repairs Refused -781 Cash through the bank HC NaN Repeater XNA Cash walk-in Credit and cash offices -1 XNA 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN
1,670,214 rows, 37 columns
In [ ]:
print(f"There are  {appsDF.shape[0]:,} previous applications")
There are  1,670,214 previous applications
In [ ]:
#Find the intersection of two arrays.
print(f'Number of train applicants with previous applications is {len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_train"]["SK_ID_CURR"])):,}')
Number of train applicants with previous applications is 291,057
In [ ]:
#Find the intersection of two arrays.
print(f'Number of train applicants with previous applications is {len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])):,}')
Number of train applicants with previous applications is 47,800
In [ ]:
# How many previous applciations  per applicant in the previous_application 
prevAppCounts = appsDF['SK_ID_CURR'].value_counts(dropna=False)
len(prevAppCounts[prevAppCounts >40])  #more that 40 previous applications
plt.hist(prevAppCounts[prevAppCounts>=0], bins=100)
plt.grid()
In [ ]:
prevAppCounts[prevAppCounts >50].plot(kind='bar')
plt.xticks(rotation=90)
plt.show()

Histogram of Number of previous applications for an ID¶

In [ ]:
sum(appsDF['SK_ID_CURR'].value_counts()==1)
Out[ ]:
60458
In [ ]:
plt.hist(appsDF['SK_ID_CURR'].value_counts(), cumulative =True, bins = 100);
plt.grid()
plt.ylabel('cumulative number of IDs')
plt.xlabel('Number of previous applications per ID')
plt.title('Histogram of Number of previous applications for an ID')
Out[ ]:
Text(0.5, 1.0, 'Histogram of Number of previous applications for an ID')
Can we differentiate applications by low, medium and high previous apps?¶
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)
In [ ]:
apps_all = appsDF['SK_ID_CURR'].nunique()
apps_5plus = appsDF['SK_ID_CURR'].value_counts()>=5
apps_40plus = appsDF['SK_ID_CURR'].value_counts()>=40
print('Percentage with 10 or more previous apps:', np.round(100.*(sum(apps_5plus)/apps_all),5))
print('Percentage with 40 or more previous apps:', np.round(100.*(sum(apps_40plus)/apps_all),5))
Percentage with 10 or more previous apps: 41.76895
Percentage with 40 or more previous apps: 0.03453

Joining secondary tables with the primary table¶

In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?

Joining previous_application with application_x¶

We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.

Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:

  • A simple feature could be the number of previous applications.
  • Other summary features of original features such as AMT_APPLICATION, AMT_CREDIT could be based on average, min, max, median, etc.

To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).

When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:

  1. Preprocess each of the non-application data sets, thereby generating many new (derived) features, and then joining (aka merge) the results with the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]
  • Do the joins as part of the transformation steps. [Not recommended here. WHY?]. How can this be done? Will it work?
    • This would be necessary if we had dataset wide features such as IDF (inverse document frequency) which depend on the entire subset of data as opposed to a single loan application (e.g., a feature about the relative amount applied for such as the percentile of the loan amount being applied for).

I want you to think about this section and build on this.

Roadmap for secondary table processing¶

  1. Transform all the secondary tables to features that can be joined into the main table the application table (labeled and unlabeled)
    • 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
    • 'previous_application', 'POS_CASH_balance'
  • Merge the transformed secondary tables with the primary tables (i.e., the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset)), thereby leading to X_train, y_train, X_valid, etc.
  • Proceed with the learning pipeline using X_train, y_train, X_valid, etc.
  • Generate a submission file using the learnt model
In [ ]:
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704)]
Out[ ]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
6 2315218 175704 Cash loans NaN 0.0 0.0 NaN NaN TUESDAY 11 Y 1 NaN NaN NaN XNA Canceled -14 XNA XAP NaN Repeater XNA XNA XNA Credit and cash offices -1 XNA NaN XNA Cash NaN NaN NaN NaN NaN NaN
In [ ]:
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704)]["AMT_CREDIT"]
Out[ ]:
6    0.0
Name: AMT_CREDIT, dtype: float64
In [ ]:
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704) & ~(appsDF["AMT_CREDIT"]==1.0)]
Out[ ]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
6 2315218 175704 Cash loans NaN 0.0 0.0 NaN NaN TUESDAY 11 Y 1 NaN NaN NaN XNA Canceled -14 XNA XAP NaN Repeater XNA XNA XNA Credit and cash offices -1 XNA NaN XNA Cash NaN NaN NaN NaN NaN NaN

Feature engineering¶

In [ ]:
#appsDF['agg_op_features'].head()

The groupby output will have an index or multi-index on rows corresponding to your chosen grouping variables. To avoid setting this index, pass “as_index=False” to the groupby operation.

import pandas as pd
import dateutil

# Load data from csv file
data = pd.DataFrame.from_csv('phone_data.csv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)

data.groupby('month', as_index=False).agg({"duration": "sum"})

Pandas reset_index() to convert Multi-Index to Columns We can simplify the multi-index dataframe using reset_index() function in Pandas. By default, Pandas reset_index() converts the indices to columns.

Fixing Column names after Pandas agg() function to summarize grouped data¶

Since we have both the variable name and the operation performed in two rows in the Multi-Index dataframe, we can use that and name our new columns correctly.

For more details unstacking groupby results and examples please see here

For more details and examples please see here

In [ ]:
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
print(f"{appsDF[features].describe()}")
agg_ops = ["min", "max", "mean"]
result = appsDF.groupby(["SK_ID_CURR"], as_index=False).agg("mean") #group by ID
display(result.head())
print("-"*50)
result = appsDF.groupby(["SK_ID_CURR"], as_index=False).agg({'AMT_ANNUITY' : agg_ops, 'AMT_APPLICATION' : agg_ops})
result.columns = result.columns.map('_'.join)
display(result)
result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
print(f"result.shape: {result.shape}")
result[0:10]
        AMT_ANNUITY  AMT_APPLICATION
count  1.297979e+06     1.670214e+06
mean   1.595512e+04     1.752339e+05
std    1.478214e+04     2.927798e+05
min    0.000000e+00     0.000000e+00
25%    6.321780e+03     1.872000e+04
50%    1.125000e+04     7.104600e+04
75%    2.065842e+04     1.803600e+05
max    4.180581e+05     6.905160e+06
SK_ID_CURR SK_ID_PREV AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE HOUR_APPR_PROCESS_START NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED DAYS_DECISION SELLERPLACE_AREA CNT_PAYMENT DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 100001 1.369693e+06 3951.000 24835.50 23787.00 2520.0 24835.5 13.000000 1.0 0.104326 NaN NaN -1740.0 23.0 8.0 365243.0 -1709.000000 -1499.000000 -1619.000000 -1612.000000 0.000000
1 100002 1.038818e+06 9251.775 179055.00 179055.00 0.0 179055.0 9.000000 1.0 0.000000 NaN NaN -606.0 500.0 24.0 365243.0 -565.000000 125.000000 -25.000000 -17.000000 0.000000
2 100003 2.281150e+06 56553.990 435436.50 484191.00 3442.5 435436.5 14.666667 1.0 0.050030 NaN NaN -1305.0 533.0 10.0 365243.0 -1274.333333 -1004.333333 -1054.333333 -1047.333333 0.666667
3 100004 1.564014e+06 5357.250 24282.00 20106.00 4860.0 24282.0 5.000000 1.0 0.212008 NaN NaN -815.0 30.0 4.0 365243.0 -784.000000 -694.000000 -724.000000 -714.000000 0.000000
4 100005 2.176837e+06 4813.200 22308.75 20076.75 4464.0 44617.5 10.500000 1.0 0.108964 NaN NaN -536.0 18.0 12.0 365243.0 -706.000000 -376.000000 -466.000000 -460.000000 0.000000
--------------------------------------------------
SK_ID_CURR_ AMT_ANNUITY_min AMT_ANNUITY_max AMT_ANNUITY_mean AMT_APPLICATION_min AMT_APPLICATION_max AMT_APPLICATION_mean
0 100001 3951.000 3951.000 3951.000000 24835.5 24835.5 24835.500
1 100002 9251.775 9251.775 9251.775000 179055.0 179055.0 179055.000
2 100003 6737.310 98356.995 56553.990000 68809.5 900000.0 435436.500
3 100004 5357.250 5357.250 5357.250000 24282.0 24282.0 24282.000
4 100005 4813.200 4813.200 4813.200000 0.0 44617.5 22308.750
... ... ... ... ... ... ... ...
338852 456251 6605.910 6605.910 6605.910000 40455.0 40455.0 40455.000
338853 456252 10074.465 10074.465 10074.465000 57595.5 57595.5 57595.500
338854 456253 3973.095 5567.715 4770.405000 19413.0 28912.5 24162.750
338855 456254 2296.440 19065.825 10681.132500 18846.0 223789.5 121317.750
338856 456255 2250.000 54022.140 20775.391875 45000.0 1170000.0 362770.875

338857 rows × 7 columns

result.shape: (338857, 8)
Out[ ]:
SK_ID_CURR_ AMT_ANNUITY_min AMT_ANNUITY_max AMT_ANNUITY_mean AMT_APPLICATION_min AMT_APPLICATION_max AMT_APPLICATION_mean range_AMT_APPLICATION
0 100001 3951.000 3951.000 3951.000000 24835.5 24835.5 24835.500000 0.0
1 100002 9251.775 9251.775 9251.775000 179055.0 179055.0 179055.000000 0.0
2 100003 6737.310 98356.995 56553.990000 68809.5 900000.0 435436.500000 831190.5
3 100004 5357.250 5357.250 5357.250000 24282.0 24282.0 24282.000000 0.0
4 100005 4813.200 4813.200 4813.200000 0.0 44617.5 22308.750000 44617.5
5 100006 2482.920 39954.510 23651.175000 0.0 688500.0 272203.260000 688500.0
6 100007 1834.290 22678.785 12278.805000 17176.5 247500.0 150530.250000 230323.5
7 100008 8019.090 25309.575 15839.696250 0.0 450000.0 155701.800000 450000.0
8 100009 7435.845 17341.605 10051.412143 40455.0 110160.0 76741.714286 69705.0
9 100010 27463.410 27463.410 27463.410000 247212.0 247212.0 247212.000000 0.0

feature transformer¶

In [ ]:
# Create aggregate features (via pipeline)
class FeaturesAggregater(BaseEstimator, TransformerMixin):
    def __init__(self, features=None): # no *args or **kargs
        self.features = features
        self.agg_op_features = {}
        for f in features:
            #self.agg_op_features[f] = {f"{f}_{func}":func for func in ["min", "max", "mean"]}
            self.agg_op_features[f]=["min", "max", "mean","median"]

    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        #from IPython.core.debugger import Pdb as pdb;    pdb().set_trace() #breakpoint; dont forget to quit         
        result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
        #result.columns = result.columns.droplevel()
        result.columns = ["_".join(x) for x in result.columns.ravel()]
        result = result.reset_index(level=["SK_ID_CURR"])
        return result # return dataframe with the join key "SK_ID_CURR"

Feature Selection¶

The features selected for analysis based on their higher correlation with the TARGET column in different datasets like bureau, credit_card_balance, previous_application, POS_CASH_balance, and installments_payments. These features provide insights into the borrower's payment behavior, creditworthiness, payment patterns, and repayment risk, which can aid in making informed decisions regarding loan approvals, risk assessments, and borrower management.

  • We selected DAYS_CREDIT, DAYS_CREDIT_UPDATE, and DAYS_ENDDATE_FACT features from the bureau dataset. The correlation analysis indicates that borrowers who have longer periods since the last credit inquiry (DAYS_CREDIT), or longer periods since the last update of credit-related information (DAYS_CREDIT_UPDATE), or longer periods since the end of the credit line (DAYS_ENDDATE_FACT) are slightly more likely to have difficulty repaying the loan.

  • Similarly, we selected MONTHS_BALANCE, AMT_BALANCE, and CNT_DRAWINGS_ATM_CURRENT features from credit_card_balance. The correlation analysis suggests that borrowers who have a higher balance (AMT_BALANCE) or more ATM withdrawals (CNT_DRAWINGS_ATM_CURRENT) are slightly more likely to have difficulty in repaying the loan. The MONTHS_BALANCE feature indicates the number of months between the current month and the month when the credit card balance was calculated, which could also impact the loan repayment ability of the borrower.

  • We selected AMT_CREDIT and CNT_PAYMENT features from previous_application. The AMT_CREDIT feature provides information on the total amount of credit requested by the borrower for the previous application, which can help in assessing the borrower's creditworthiness and repayment capacity. The CNT_PAYMENT feature indicates the number of payments that the borrower has committed to make, which can help in predicting the borrower's payment behavior and loan repayment risk. Analyzing the correlation between these features and the TARGET column can provide insights into the relationship between the borrower's creditworthiness, payment behavior, and loan repayment difficulty. This can aid in making informed decisions regarding loan approvals, risk assessments, and borrower management.

  • For analysis of the POS_CASH_balance dataset, we selected the features MONTHS_BALANCE, CNT_INSTALMENT_FUTURE, and CNT_INSTALMENT. The correlation analysis suggests that borrowers with a higher number of future installment payments (CNT_INSTALMENT_FUTURE) are slightly more likely to have difficulty repaying the loan. The MONTHS_BALANCE feature provides insight into the length of time the borrower has been making installment payments, which can be used to understand payment behavior and repayment risk. The CNT_INSTALMENT feature gives information about the number of installments in the payment plan, which can be used to identify payment patterns and potential risk factors.

  • We selected DAYS_ENTRY_PAYMENT, DAYS_INSTALMENT, and AMT_PAYMENT features from installments_payments. The correlation analysis shows that borrowers who pay later (DAYS_ENTRY_PAYMENT) or have longer installment periods (DAYS_INSTALMENT) are slightly more likely to have difficulty repaying the loan. AMT_PAYMENT helps to understand the relationship between borrower payment behavior and loan repayment difficulty. Additionally, it can be used to calculate other crucial features, such as late payment or underpayment amounts, providing insights into borrower payment behavior and loan repayment risk.

In [ ]:
_#LOAD THE DATASET
X_train= datasets["application_train"]#primary dataset
appsDF = datasets["previous_application"] #prev app
burDF=datasets['bureau']
bureau_balDF=datasets['bureau_balance']
ccbDF=datasets['credit_card_balance']
ipDF=datasets['installments_payments']
posDF=datasets['POS_CASH_balance']

#SELECTED FEATURES ARE TAKEN FROM SECONDARY TABLES BASED ON CORRELATION ANALAYSIS
pr_features = ['AMT_CREDIT','CNT_PAYMENT']
br_features = ['DAYS_CREDIT', 'DAYS_CREDIT_UPDATE','DAYS_ENDDATE_FACT']
cc_features = ['MONTHS_BALANCE', 'AMT_BALANCE', 'CNT_DRAWINGS_ATM_CURRENT']
ip_features = ['DAYS_ENTRY_PAYMENT','DAYS_INSTALMENT', 'AMT_PAYMENT']
pos_features = ['MONTHS_BALANCE','CNT_INSTALMENT_FUTURE', 'CNT_INSTALMENT']
In [ ]:
#PREVIOUS APPLICATION TRANSFORMATION


prevApps_feature_pipeline = Pipeline([
        ('prevApps_aggregater', FeaturesAggregater(pr_features)), # Aggregate across old and new features
    ])
prevApps_aggregated = prevApps_feature_pipeline.fit_transform(appsDF)


#Bureau Transformation


bureau_feature_pipeline = Pipeline([
        ('bureau_aggregater', FeaturesAggregater(br_features)), # Aggregate across old and new features
    ])
bureau_aggregated = bureau_feature_pipeline.fit_transform(burDF)

#Credit Card Balance Transformation


ccbal_feature_pipeline = Pipeline([
        ('ccbal_aggregater', FeaturesAggregater(cc_features)), # Aggregate across old and new features
    ])
ccbal_aggregated = ccbal_feature_pipeline.fit_transform(ccbDF)

#Installments Payments Transformation


inspay_feature_pipeline = Pipeline([
        ('inspay_aggregater', FeaturesAggregater(ip_features)), # Aggregate across old and new features
    ])
inspay_aggregated = inspay_feature_pipeline.fit_transform(ipDF)

#POS CASH Balance Transformation


poscashbal_feature_pipeline = Pipeline([
        ('poscashbal_aggregater', FeaturesAggregater(pos_features)), # Aggregate across old and new features
    ])
poscashbal_aggregated = poscashbal_feature_pipeline.fit_transform(posDF)

Join the labeled dataset¶

In [ ]:
# merge primary table and secondary tables using features based on meta data and  aggregage stats 
merged_data=True

if merged_data:
    # 1. Join/Merge in prevApps Data
    X_train = X_train.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')

    # 2. Join/Merge in Bureau Data
    X_train = X_train.merge(bureau_aggregated, how='left', on="SK_ID_CURR")

    # 3. Join/Merge in credit_card_balance  Data
    X_train = X_train.merge(ccbal_aggregated, how='left', on="SK_ID_CURR")

    # 4. Join/Merge in Aggregated installments_payments Data
    X_train = X_train.merge(inspay_aggregated, how='left', on="SK_ID_CURR")

    # 5. Join/Merge in Aggregated POS_CASH_balance Data
    X_train = X_train.merge(poscashbal_aggregated, how='left', on="SK_ID_CURR")

Join the unlabeled dataset (i.e., the submission file)¶

In [ ]:
if merged_data:
  X_kaggle_test= datasets["application_test"]
  X_kaggle_test = X_kaggle_test.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
  X_kaggle_test = X_kaggle_test.merge(bureau_aggregated, how='left', on='SK_ID_CURR')
  X_kaggle_test = X_kaggle_test.merge(ccbal_aggregated, how='left', on='SK_ID_CURR')
  X_kaggle_test = X_kaggle_test.merge(inspay_aggregated, how='left', on='SK_ID_CURR')
  X_kaggle_test = X_kaggle_test.merge(poscashbal_aggregated, how='left', on='SK_ID_CURR')

   
In [ ]:
# Convert categorical features to numerical approximations (via pipeline)
class ClaimAttributesAdder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None): 
        charlson_idx_dt = {'0': 0, '1-2': 2, '3-4': 4, '5+': 6}
        los_dt = {'1 day': 1, '2 days': 2, '3 days': 3, '4 days': 4, '5 days': 5, '6 days': 6,
          '1- 2 weeks': 11, '2- 4 weeks': 21, '4- 8 weeks': 42, '26+ weeks': 180}
        X['PayDelay'] = X['PayDelay'].apply(lambda x: int(x) if x != '162+' else int(162))
        X['DSFS'] = X['DSFS'].apply(lambda x: None if pd.isnull(x) else int(x[0]) + 1)
        X['CharlsonIndex'] = X['CharlsonIndex'].apply(lambda x: charlson_idx_dt[x])
        X['LengthOfStay'] = X['LengthOfStay'].apply(lambda x: None if pd.isnull(x) else los_dt[x])
        return X
    
    

Other Feature Transformation¶

In [ ]:
X_train['INCOME_CREDIT_PERCENT'] = (X_train.AMT_INCOME_TOTAL / X_train.AMT_CREDIT).replace(np.inf, 0)  
X_train['REPAYMENT_PERCENT'] = (X_train.CNT_INSTALMENT_FUTURE_max / X_train.CNT_INSTALMENT_min).replace(np.inf, 0)
X_train['ANN_INC_PERCENT'] = (X_train.AMT_ANNUITY / X_train.AMT_INCOME_TOTAL).replace(np.inf, 0)
In [ ]:
X_kaggle_test['INCOME_CREDIT_PERCENT'] = (X_kaggle_test.AMT_INCOME_TOTAL / X_kaggle_test.AMT_CREDIT).replace(np.inf, 0)  
X_kaggle_test['REPAYMENT_PERCENT'] = (X_kaggle_test.CNT_INSTALMENT_FUTURE_max / X_kaggle_test.CNT_INSTALMENT_min).replace(np.inf, 0)
X_kaggle_test['ANN_INC_PERCENT'] = (X_kaggle_test.AMT_ANNUITY / X_kaggle_test.AMT_INCOME_TOTAL).replace(np.inf, 0)
  • This new column, named 'INCOME_CREDIT_PERCENT', will give us a measure of how much of the applicant's total income is being used to pay off the loan. A high percentage may indicate that the applicant is taking on too much debt relative to their income, which may increase the risk of default. Similarly, a low percentage may indicate that the applicant has a strong financial position and can comfortably repay the loan. Lenders often use this ratio to assess the borrower's creditworthiness and to set borrowing limits.

  • This new column, named 'REPAYMENT_PERCENT', can be used as a measure of the applicant's repayment history. If the ratio is high, it indicates that the applicant is able to pay off their loans quickly and is therefore less likely to default. However, if the ratio is low, it may indicate that the applicant struggles to pay off their loans on time, potentially indicating a higher risk of default.

  • This new column, named 'ANN_INC_PERCENT', is an important metric for assessing the applicant's financial stability and their ability to make timely payments, as it represents the percentage of their income that is being used to make annuity payments. A higher value for this ratio indicates a greater proportion of income being used towards the annuity payments, which may be a potential red flag for lenders as it increases the risk of default. On the other hand, a lower value for this ratio suggests that the applicant has more disposable income available for other expenses, and hence, may be less likely to default on their loan. Therefore, this new column 'ANN_INC_PERCENT' serves as an important feature for lenders to evaluate the creditworthiness of an applicant.

Handling some Class Imbalance¶

There are several techniques that handles class imbalance thing. This involves either oversampling the minority class or undersampling the majority class to balance the number of samples in each class. Oversampling can be done by replicating the minority class samples, while undersampling involves randomly removing samples from the majority class. There are several libraries in Python like imblearn that offer different techniques of resampling such as RandomOverSampler, RandomUnderSampler, SMOTE, etc. We tried both of them but difference is so huge that it's not feasible to do any because in one it's affecting training accuracy as during undersampling information gets lost and in oversampling, Server crashes as it was leading to many rows. So we decided to remove rows of majority class with more than 40% missing value.

In [ ]:
mask = (X_train['TARGET'] == 0) & (X_train.isna().mean(axis=1) >= 0.4)

# drop rows that satisfy the condition
X_train = X_train.drop(X_train[mask].index)
In [ ]:
X_train.shape
Out[ ]:
(287427, 178)

Rows went down from 307511 to 286483

DROP COLUMNS WITH MAJOR MISSING VALUES¶

In [ ]:
from copy import deepcopy
In [ ]:
percent = (X_train.isnull().sum()/X_train.isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = X_train.isna().sum().sort_values(ascending = False)
total_count_with = X_train.isnull().count()
total_count_without = X_train.count()
missing_application_train_data  = pd.concat([percent, sum_missing, total_count_without, total_count_with], axis=1, keys=['Percent', 'Missing Count', 'Total Count without Missing value', 'Total Count'])
missing_application_train_data
Out[ ]:
Percent Missing Count Total Count without Missing value Total Count
CNT_DRAWINGS_ATM_CURRENT_max 78.73 225559 60924 286483
CNT_DRAWINGS_ATM_CURRENT_mean 78.73 225559 60924 286483
CNT_DRAWINGS_ATM_CURRENT_median 78.73 225559 60924 286483
CNT_DRAWINGS_ATM_CURRENT_min 78.73 225559 60924 286483
MONTHS_BALANCE_median_x 69.86 200129 86354 286483
... ... ... ... ...
REG_CITY_NOT_WORK_CITY 0.00 0 286483 286483
REG_CITY_NOT_LIVE_CITY 0.00 0 286483 286483
LIVE_REGION_NOT_WORK_REGION 0.00 0 286483 286483
REG_REGION_NOT_WORK_REGION 0.00 0 286483 286483
SK_ID_CURR 0.00 0 286483 286483

181 rows × 4 columns

In [ ]:
X_train_missing_data = pd.DataFrame((X_train.isnull().sum())/X_train.shape[0])
X_train_missing_col = X_train_missing_data[X_train_missing_data[0] > 0.6].index
X_train = X_train[[col for col in X_train.columns if col not in X_train_missing_col]]
X_train.shape
Out[ ]:
(286483, 152)

Drop Columns using Coorelation¶

In [ ]:
#ref: https://stackoverflow.com/questions/29294983/how-to-calculate-correlation-between-all-columns-and-remove-highly-correlated-on

X_train_corr_matrix = X_train.corr().abs()

# Select upper triangle of correlation matrix
upper = X_train_corr_matrix.where(np.triu(np.ones(X_train_corr_matrix.shape), k=1).astype(bool))

# Find features with correlation greater than 0.85
to_drop = [column for column in upper.columns if any(upper[column] > 0.85)]

# Drop features
X_train.drop(to_drop, axis=1, inplace=True)

X_train.shape
Out[ ]:
(286483, 107)
In [ ]:
X_train2=deepcopy(X_train)
In [ ]:
X_train=deepcopy(X_train2)

Processing pipeline¶

Please this blog for more details of OHE when the validation/test have previously unseen unique values.

HCDR preprocessing¶

In [ ]:
# Split the provided training data into training and validationa and test
# The kaggle evaluation test set has no labels
#
from sklearn.model_selection import train_test_split

use_application_data_ONLY = False #use joined data
if use_application_data_ONLY:
    # just selected a few features for a baseline experiment
    selected_features1 = [col for col in datasets["application_train"].columns if col != 'TARGET']

    selected_features2= selected_features1+br_features+pr_features+cc_features+ip_features+pos_features+['NAME_CONTRACT_TYPE']
    
    selected_features3 = ['AMT_INCOME_TOTAL',  'AMT_CREDIT','DAYS_EMPLOYED','DAYS_BIRTH','EXT_SOURCE_1',
        'EXT_SOURCE_2','EXT_SOURCE_3','CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
                   'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
    
    X_train = datasets["application_train"][selected_features2]
    y_train = datasets["application_train"]['TARGET']
    X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
    X_kaggle_test= datasets["application_test"][selected_features2]
    # y_test = datasets["application_test"]['TARGET']   #why no  TARGET?!! (hint: kaggle competition)

else: 
    y_train = X_train['TARGET']
    selected_features1 = [col for col in X_train.columns if col not in ['TARGET','SK_ID_CURR']]
    selected_features = ['AMT_INCOME_TOTAL',  'AMT_CREDIT','DAYS_EMPLOYED','DAYS_BIRTH','EXT_SOURCE_1',
            'EXT_SOURCE_2','EXT_SOURCE_3','CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
                      'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
    X_train = X_train[selected_features1]
    X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, stratify=y_train,test_size=0.15, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train, test_size=0.15, random_state=42)
    X_kaggle_test= X_kaggle_test[selected_features1]
# y_test = datasets["application_test"]['TARGET']   #why no  TARGET?!! (hint: kaggle competition)

    
print(f"X train           shape: {X_train.shape}")
print(f"X validation      shape: {X_valid.shape}")
print(f"X test            shape: {X_test.shape}")
print(f"X X_kaggle_test   shape: {X_kaggle_test.shape}")
X train           shape: (206983, 105)
X validation      shape: (42973, 105)
X test            shape: (36527, 105)
X X_kaggle_test   shape: (48744, 105)

Feature Importance¶

In [ ]:
X_train1=deepcopy(X_train)
In [ ]:
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from datetime import datetime
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import time
In [ ]:
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.naive_bayes import GaussianNB
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix

from time import time
import xgboost
from xgboost import plot_importance
In [ ]:
col=X_train1.columns
In [ ]:
col
In [ ]:
num_attribs1 = X_train1.select_dtypes(include=['int64', 'float64']).columns
imputer = SimpleImputer(strategy='mean')
X_train1[num_attribs1] = imputer.fit_transform(X_train1[num_attribs1])

cat_attribs1 = X_train1.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for col in cat_attribs1: 
    imputer = SimpleImputer(strategy='constant', fill_value='missing')
    X_train1[col] = imputer.fit_transform(X_train1[[col]])
    X_train1[col] = label_encoder.fit_transform(X_train1[col])

# Apply standard scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train1)
In [ ]:
xgb_model = XGBClassifier()
xgb_model.fit(X_train_scaled, y_train)

# Calculate feature importances
importances = xgb_model.feature_importances_
In [ ]:
col=X_train1.columns
# create a DataFrame of feature importances
importance_df = pd.DataFrame({
    "Feature": col,
    "Importance": importances
})

# sort the DataFrame by feature importance in descending order
importance_df = importance_df.sort_values("Importance", ascending=False)
importance_df
Out[ ]:
Feature Importance
80 CNT_PAYMENT_min 0.071660
46 EMERGENCYSTATE_MODE 0.054275
70 AMT_REQ_CREDIT_BUREAU_HOUR 0.049927
71 AMT_REQ_CREDIT_BUREAU_DAY 0.048500
35 EXT_SOURCE_3 0.043497
... ... ...
60 FLAG_DOCUMENT_12 0.000000
58 FLAG_DOCUMENT_10 0.000000
57 FLAG_DOCUMENT_9 0.000000
55 FLAG_DOCUMENT_7 0.000000
52 FLAG_DOCUMENT_4 0.000000

105 rows × 2 columns

In [ ]:
# Sort the feature importances in ascending order
sorted_idx = importances.argsort()

# Create the bar plot
fig, ax = plt.subplots(figsize=(15, 20))
ax.barh(col[sorted_idx], importances[sorted_idx])

# Add annotations to the plot
for i, v in enumerate(importances[sorted_idx]):
    ax.text(v, i, "{:.3f}".format(v), color='blue', fontweight='bold', fontsize=10)

ax.set_xlabel("XGBoost Feature Importance")
plt.show()
In [ ]:
threshold = 0.005
importance_df_filtered = importance_df[importance_df["Importance"] > threshold]
In [ ]:
final_selected_feature=importance_df_filtered['Feature'].values
In [ ]:
X_train_final=X_train[final_selected_feature]
In [ ]:
X_test_final=X_test[final_selected_feature]
In [ ]:
X_valid_final=X_valid[final_selected_feature]
In [ ]:
X_kaggle_test1=deepcopy(X_kaggle_test)
In [ ]:
X_kaggle_test=X_kaggle_test[final_selected_feature]

Save the selected dataframes¶

In [ ]:
X_train_final.to_csv("X_train.csv")
X_test_final.to_csv("X_test.csv")
X_valid_final.to_csv("X_valid.csv")
X_kaggle_test.to_csv("X_kaggle_test.csv")
In [ ]:
y_train.to_csv("y_train.csv")
y_test.to_csv("y_test.csv")
y_valid.to_csv("y_valid.csv")
In [ ]:
# Create a class to select numerical or categorical columns 
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values
In [ ]:
numerical_features = X_train_final.select_dtypes(include=[np.number])
categorical_features = X_train_final.select_dtypes(exclude=[np.number])

numerical_features = list(numerical_features.columns)
categorical_features = list(categorical_features.columns)

print("=" * 30)
print("|     Numerical Features     |")
print("=" * 30)
for features in numerical_features:
  print("|  " + features)
print("=" * 30)

print("\n\n")
print("=" * 30)
print("|    Categorical Features    |")
print("=" * 30)
for features in categorical_features:
  print("|  " + features)
print("=" * 30)
In [ ]:
# Identify the numeric features we wish to consider.

num_attribs1= X_train_final.select_dtypes(include=['int64', 'float64']).columns


num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs1)),
        ('imputer', SimpleImputer(strategy='mean')),
        ('std_scaler', StandardScaler()),
    ])

# Identify the categorical features we wish to consider.

cat_attribs1 = X_train_final.select_dtypes(include=['object']).columns

# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs1)),
        #('imputer', SimpleImputer(strategy='most_frequent')),
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

data_prep_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])
              

Baseline Model¶

To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model

In [ ]:
from joblib import load
from joblib import dump
In [ ]:
def train_model(model_name,model,description,hyper,saved):
    input_f=str(X_train_final.shape[1])
    model_name=model_name+"-"+input_f+" input features"
    if saved:
      # load the saved model
      pipeline = load(f'{model_name}.joblib')
    else:
        print(model_name)
        pipeline = Pipeline([
            ("preparation", data_prep_pipeline),
            ('model',model)])

        t0 = time()
        
        pipeline.fit(X_train_final, y_train)
        dump(pipeline, f'{model_name}.joblib')

        Training_time = time()-t0

    
    if hyper:
        print("Best parameters")
        best_p=pipeline.named_steps["model"].best_params_
        print(best_p)
    else:
        best_p=pipeline.named_steps["model"].get_params()
        print(best_p)
    

    y_pred_train=pipeline.predict(X_train_final)
    train_accuracy = accuracy_score(y_train, y_pred_train)
    train_r_score=roc_auc_score(y_train, pipeline.predict_proba(X_train_final)[:, 1])

    #for validation set
    y_pred_valid=pipeline.predict(X_valid_final)
    valid_accuracy = accuracy_score(y_valid, y_pred_valid)
    valid_r_score=roc_auc_score(y_valid, pipeline.predict_proba(X_valid_final)[:, 1])
    
    
    
    # Make predictions on test
    t1= time()
    y_pred_test=pipeline.predict(X_test_final)
    y_pred_model=pipeline.predict_proba(X_test_final)[:, 1]
    testing_time=time() - t1
    test_accuracy = accuracy_score(y_test, y_pred_test)
    test_r_score=roc_auc_score(y_test, pipeline.predict_proba(X_test_final)[:, 1])
    
    
    print(classification_report(y_test, y_pred_test))  

    # Create confusion matrix for y_pred_train
    cm_train = confusion_matrix(y_train, y_pred_train)

    # Create confusion matrix for y_pred_test
    cm_test = confusion_matrix(y_test, y_pred_test)


    # Visualize confusion matrix for y_pred_train
    sns.heatmap(cm_train, annot=True, cmap='Blues')
    plt.title('Confusion Matrix (Train)')
    plt.xlabel('Predicted Labels')
    plt.ylabel('True Labels')
    plt.show()

    # Visualize confusion matrix for y_pred_test
    sns.heatmap(cm_test, annot=True, cmap='Blues')
    plt.title('Confusion Matrix (Test)')
    plt.xlabel('Predicted Labels')
    plt.ylabel('True Labels')
    plt.show()




    expLog.loc[len(expLog)] =[f"{model_name}",f"{train_accuracy*100:8.2f}%",f"{valid_accuracy*100:8.2f}%",f"{test_accuracy*100:8.2f}%",\
                              f"{train_r_score*100:8.2f}%",f"{valid_r_score*100:8.2f}%",f"{test_r_score*100:8.2f}%",\
                                    f"{Training_time:8.2f} secs", f"{testing_time:8.2f} secs",
                                    f"{best_p}",description]
    return y_pred_model
In [ ]:
expLog=pd.DataFrame(columns=["exp_name", 
                                   "Train Acc", 
                                   "Valid Acc",
                                   "Test  Acc",
                                   "Train AUC", 
                                   "Valid AUC",
                                   "Test  AUC",
                                   "Train time",
                                   "Test time",
                                   "Best Params",
                                   "Description"
                                  ])
In [ ]:
try:
    expLog
except NameError:
    expLog = pd.DataFrame(columns=["exp_name", 
                                   "Train Acc", 
                                   "Valid Acc",
                                   "Test  Acc",
                                   "Train AUC", 
                                   "Valid AUC",
                                   "Test  AUC",
                                   "Train time",
                                   "Test time",
                                   "Best Params",
                                   "Description"
                                  ])
In [ ]:
expLog

Logistic Regression¶

In [ ]:
%%time 
np.random.seed(42)
lr_baseline_pipeline = Pipeline([
        ("preparation", data_prep_pipeline),
        ("model", LogisticRegression())
    ])
lr_base_model = lr_baseline_pipeline.fit(X_train_final, y_train)
In [ ]:
y_pred_lr=train_model("Baseline Logisitc Regression",LogisticRegression(),"Joined Feature Aggregated in input features",False,False)
In [ ]:
expLog

Evaluation metrics¶

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.

from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75

Decision Tree¶

In [ ]:
y_pred_dt=train_model("Baseline Decision Tree",DecisionTreeClassifier(random_state=42),"Joined Feature Aggregated in input features",False,False)
In [ ]:
expLog

Random Forest¶

In [ ]:
y_pred_rf=train_model("Baseline Random Forest",RandomForestClassifier(),"Joined Feature Aggregated in input features",False,False)
In [ ]:
expLog

Gaussian Naive Bayes¶

In [ ]:
y_pred_gb=train_model("Baseline GaussianNB",GaussianNB(),"Joined Feature Aggregated in input features",False,False)
In [ ]:
expLog
In [ ]:
 

Compare Baseline Results¶

In [ ]:
from sklearn import metrics
plt.figure(figsize=(8, 8))
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_lr)
auc = round(metrics.roc_auc_score(y_test, y_pred_lr), 4)
plt.plot(fpr,tpr,label="Logistic Regression, AUC="+str(auc))

fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_dt)
auc = round(metrics.roc_auc_score(y_test, y_pred_dt), 4)
plt.plot(fpr,tpr,label="Decision Tree Classifier, AUC="+str(auc))


fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_rf)
auc = round(metrics.roc_auc_score(y_test, y_pred_rf), 4)
plt.plot(fpr,tpr,label="Random Forest Classifier, AUC="+str(auc))

fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_gb)
auc = round(metrics.roc_auc_score(y_test, y_pred_gb), 4)
plt.plot(fpr,tpr,label="Gaussian Naive Bayes Classifier, AUC="+str(auc))

plt.legend()

Ensemble Techniques¶

In [ ]:
y_pred_XGB=train_model("Baseline XGBboost",XGBClassifier(),"Joined Feature Aggregated in input features",False,False)
In [ ]:
expLog

We also tried Gradient Boosting, it was taking more time than XGBoost and the accuracy was still less, so decided to procees with one ensemble technique

In [ ]:
#train_model("Baseline GradientBoosting",GradientBoostingClassifier(),"Joined Feature Aggregated in input features",False)
In [ ]:
#expLog

HYPERPARAMTER TUNING¶

Logistic Regression¶

Hyperparameters to be tuned:

  - tune_model - The machine learning model to be tuned - Logistic Regression
  - params: 
	c - the regularization parameter is set to [0.01, 10.0, 100.0]
	penalty - for regularization penalty, we are using both l1 and l2 regularization
  - cv - Number of cv folds is 3
  - scoring - the evaluation metric is accuracy
  - n_jobs - number of parallel jobs, we are using all available processors i.e. -1
  - verbose - we used the verbosity as 1 which will print information about the progress of the search

Summary - The process involves trying out all the combinations of the C and penalty hyperparameters specified in the params dictionary, and then evaluating the accuracy of the logistic regression model on the validation data for each combination. The optimal values of C and penalty that maximize the accuracy score, based on the evaluation metric specified by the scoring parameter, which is 'accuracy'

In [ ]:
tune_model=LogisticRegression()

params = {'C':[0.01,10.0, 100.0], 
          'penalty':['l1', 'l2']}

clf_gridsearch = GridSearchCV(tune_model,params,cv=3,scoring='accuracy',n_jobs=-1, verbose=1)
y_pred_lr1=train_model("Tuned Logisitc Regression",clf_gridsearch,"Joined Feature Aggregated+Hyerparameters",True,False)
In [ ]:
expLog

Gaussian Naive Bayes¶

Hyperparameters to be tuned:

  - gnb - The machine learning model to be tuned - Gaussian Naive Bayes
  - params: 
	var_smoothing -amount of smoothing to be applied is set to [1e-9, 1e-8, 1e-7, 1e-6,1e-10]
  - cv - Number of cv folds is 3
  - n_jobs - number of parallel jobs, we using all available processors i.e. -1
  - verbose - we used the verbosity as 1 which will print information about the progress of the search

Summary - The process involves performing a grid search to find the optimal value of the hyperparameter var_smoothing for the Gaussian Naive Bayes classifier.

There's no need to perform hyperparamter tuning for Gausssian Naive Bayes as it is not suited for this problem as it makes an assumption that data will be continious and normalised and clearly training accuracy is so low.

Decision Tree¶

Hyperparameters to be tuned:

   - dt - The machine learning model to be tuned - Decision Tree classifier
  - params: 
        - max_depth -the max depth of the decision tree is set to [2, 4]
        - min_samples_split - the minimum number of samples required to split an internal node is set to [2, 4]
        - min_samples_leaf - the minimum number of samples required to be at a leaf node is set to [2,4]
     - cv - Number of cv folds is 3
    - n_jobs - number of parallel jobs, we using all available processors i.e. -1
    - verbose - we used the verbosity as 1 which will print information about the progress of the search
  • Summary - Using the Decision Tree Classifier, we created a model called dt that uses decision trees to make predictions. It also sets up a grid of different values for some hyperparameters that the decision tree can use, such as max_depth, min_samples_split, and min_samples_leaf. We are using gridsearch to search for the best combination of these options, It will try different combinations of the hyperparameters from the grid and use cross-validation to evaluate how well they work. It will use all available processing cores to speed up the search and display information about the progress of the search.
In [ ]:
# Create a decision tree classifier
dt = DecisionTreeClassifier()

# Define the hyperparameter grid to search over
params = {
    'max_depth': [2, 4],
    'min_samples_split': [2, 4],
    'min_samples_leaf': [2, 4]}

clf_gridsearch = GridSearchCV(dt, params, cv=3,n_jobs=-1, verbose=1)
y_pred_dt1=train_model("Tuned Decision Tree",clf_gridsearch,"Joined Feature Aggregated+Hyerparameters",True,False)
In [ ]:
expLog

Random Forest¶

Hyperparameters to be tuned:

    - rf - The machine learning model to be tuned - Random Forest Classifier
    - params: 
        - n_estimators -the number of trees in the forest is set to [100, 200]
        - max_depth - the maximum depth of the tree is set to  [10, 20, 30]
        - min_samples_split - the minimum number of samples required to split an internal node is set to [2, 10]
        - min_samples_leaf - the minimum number of samples required to be at a leaf node is set to  [1, 2, 4]
   - cv - Number of cv folds is 3
   - n_jobs - number of parallel jobs, we using all available processors i.e. -1
    - verbose - we used the verbosity as 1 which will print information about the progress of the search

Summary - Using the Random Forest Classifier, It sets up a grid of values for various hyperparameters such as n_estimators, max_depth, min_samples_split, and min_samples_leaf. We are using gridsearch to search for the best combination of hyperparameters by trying different combinations of the hyperparameters from the grid and using cross-validation to evaluate their performance. The search is accelerated using all available processing cores and information about the progress of the search is displayed.

In [ ]:
rf = RandomForestClassifier()

# Define the hyperparameter grid to search over
params = {
    'n_estimators': [10,50],
    'min_samples_leaf': [2, 5, 10],
    'max_depth': [5, 10]  
}
clf_gridsearch = GridSearchCV(rf, params, cv=3,n_jobs=-1, verbose=1)
y_pred_rf1=train_model("Tuned Random Forest Trees",clf_gridsearch,"Joined Feature Aggregated+Hyerparameters",True,False)
In [ ]:
expLog

Comparing all the models visually¶

In [ ]:
from sklearn import metrics
plt.figure(figsize=(10, 5))
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_lr1)
auc = round(metrics.roc_auc_score(y_test, y_pred_lr), 4)
plt.plot(fpr,tpr,label="Tuned Logistic Regression, AUC="+str(auc))

fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_dt1)
auc = round(metrics.roc_auc_score(y_test, y_pred_dt), 4)
plt.plot(fpr,tpr,label="Tuned Decision Tree Classifier, AUC="+str(auc))


fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_rf1)
auc = round(metrics.roc_auc_score(y_test, y_pred_rf), 4)
plt.plot(fpr,tpr,label="Tuned Random Forest Classifier, AUC="+str(auc))

fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_XGB)
auc = round(metrics.roc_auc_score(y_test, y_pred_XGB), 4)
plt.plot(fpr,tpr,label="XGBoost Classifier, AUC="+str(auc))

plt.legend()

MLP¶

MLP Baseline Model¶

In [ ]:
from sklearn.neural_network import MLPClassifier
In [ ]:
y_pred_mlp=train_model("Baseline MLP",MLPClassifier(),"Joined Feature Aggregated",False,False)
In [ ]:
expLog

MLP Hyperparameter Tuning¶

In [ ]:
mlp = MLPClassifier()
param_grid = {
    'hidden_layer_sizes': [(120, 80, 40), (150, 100, 50), (128, 64, 32)],
    'max_iter': [5, 10, 50, 100, 150, 200],
    'activation': ['tanh', 'relu', 'sigmoid'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.1, 0.01, 0.001, 0.0001, 0.5, 0.05],
    'learning_rate': ['constant','adaptive'],
}
clf_gridsearch = GridSearchCV(mlp, params, cv=3, n_jobs=-1, verbose=1)
train_model("Tuned MLP Classifier",clf_gridsearch,"Joined Feature Aggregated+Hyperparameter Tuning")
In [ ]:
expLog

MLP using PyTorch¶

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
import numpy as np
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings('ignore')


import torch
import torch.utils.data
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset


import time
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter



torch.manual_seed(0)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
from torch.utils.data.sampler import WeightedRandomSampler
In [ ]:
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter("runs/exp4")
In [ ]:
!pip install tensorboard pyngrok
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: tensorboard in /usr/local/lib/python3.9/dist-packages (2.12.2)
Collecting pyngrok
  Downloading pyngrok-6.0.0.tar.gz (681 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 681.2/681.2 kB 22.5 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: wheel>=0.26 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (0.40.0)
Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (2.2.3)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (0.7.0)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (3.4.3)
Requirement already satisfied: google-auth-oauthlib<1.1,>=0.5 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (1.0.0)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (2.27.1)
Requirement already satisfied: protobuf>=3.19.6 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (3.20.3)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (1.8.1)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (67.7.2)
Requirement already satisfied: grpcio>=1.48.2 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (1.53.0)
Requirement already satisfied: numpy>=1.12.0 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (1.22.4)
Requirement already satisfied: absl-py>=0.4 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (1.4.0)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (2.17.3)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.9/dist-packages (from pyngrok) (6.0)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.9/dist-packages (from google-auth<3,>=1.6.3->tensorboard) (0.2.8)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.9/dist-packages (from google-auth<3,>=1.6.3->tensorboard) (1.16.0)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from google-auth<3,>=1.6.3->tensorboard) (5.3.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.9/dist-packages (from google-auth<3,>=1.6.3->tensorboard) (4.9)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.9/dist-packages (from google-auth-oauthlib<1.1,>=0.5->tensorboard) (1.3.1)
Requirement already satisfied: importlib-metadata>=4.4 in /usr/local/lib/python3.9/dist-packages (from markdown>=2.6.8->tensorboard) (6.4.1)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.21.0->tensorboard) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.21.0->tensorboard) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.21.0->tensorboard) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.21.0->tensorboard) (2022.12.7)
Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.9/dist-packages (from werkzeug>=1.0.1->tensorboard) (2.1.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.9/dist-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard) (3.15.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.9/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.9/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<1.1,>=0.5->tensorboard) (3.2.2)
Building wheels for collected packages: pyngrok
  Building wheel for pyngrok (setup.py) ... done
  Created wheel for pyngrok: filename=pyngrok-6.0.0-py3-none-any.whl size=19879 sha256=8a66edf919f4e061923607a94d1dcb32baa7728b03b6e5cf948129d2773bba4a
  Stored in directory: /root/.cache/pip/wheels/31/49/9c/44b13823eb256a3b4dff34b972f7a3c7d9910bfef269e59bd7
Successfully built pyngrok
Installing collected packages: pyngrok
Successfully installed pyngrok-6.0.0
In [ ]:
%load_ext tensorboard
In [ ]:
#Read the saved filtered datasets
X_train= pd.read_csv('X_train.csv')
X_valid = pd.read_csv('X_valid.csv')
X_test = pd.read_csv('X_test.csv')
X_kaggle_test = pd.read_csv('X_kaggle_test.csv')


y_train=pd.read_csv('y_train.csv')
y_valid=pd.read_csv('y_valid.csv')
y_test=pd.read_csv('y_test.csv')
In [ ]:
#Remove Index Column 
X_train= X_train.iloc[:,1:]
X_valid =X_valid.iloc[:,1:]
X_test = X_test.iloc[:,1:]
X_kaggle_test = X_kaggle_test.iloc[:,1:]


y_train=y_train['TARGET']
y_valid=y_valid['TARGET']
y_test=y_test['TARGET']
In [ ]:
#WHILE EVAULATING FOR X_KAGGLE_TEST, WE ENCOUNTERED SHAPE DIFFERENCES BECAUSE OF FOLLOWING COLUMNS'S VALUE AS THEY DIDN'T EXIST IN KAGGLE TEST.
#WE DECIDED TO REMOVE THOSE ROWS WHICH DOESN'T CONTAIN THE FOLLOWING CONDITION (THEY ARE ONLY 3-4 ROWS THAT GOT REMOVED)
# Define the conditions for filtering out rows
cond1 = (X_train['CODE_GENDER'] != 'XNA') & (X_train['NAME_INCOME_TYPE'] != 'Maternity leave' ) & (X_train['NAME_FAMILY_STATUS'] != 'Unknown' ) 
cond2 = (X_valid['CODE_GENDER'] != 'XNA') & (X_valid['NAME_INCOME_TYPE'] != 'Maternity leave') & (X_valid['NAME_FAMILY_STATUS'] != 'Unknown' ) 
cond3 = (X_test['CODE_GENDER'] != 'XNA') & (X_test['NAME_INCOME_TYPE'] != 'Maternity leave') & (X_test['NAME_FAMILY_STATUS'] != 'Unknown' ) 

# Filter out rows that meet the conditions
X_train = X_train.loc[cond1]
X_valid = X_valid.loc[cond2]
X_test = X_test.loc[cond3]

# Remove the corresponding rows from the target variable
y_train = y_train.loc[cond1]
y_valid = y_valid.loc[cond2]
y_test = y_test.loc[cond3]
In [ ]:
#CONVERT THEM TO NUMPY ARRAY FOR TRAINING
y_train = y_train.to_numpy()
y_validation = y_valid.to_numpy()
y_test = y_test.to_numpy()
In [ ]:
#Tranform the features again for Pytorch model
# Create a class to select numerical or categorical columns 
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

# Identify the numeric features we wish to consider.

num_attribs1= X_train.select_dtypes(include=['int64', 'float64']).columns


num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs1)),
        ('imputer', SimpleImputer(strategy='mean')),
        ('std_scaler', StandardScaler()),
    ])

# Identify the categorical features we wish to consider.

cat_attribs1 = X_train.select_dtypes(include=['object']).columns

# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs1)),
        #('imputer', SimpleImputer(strategy='most_frequent')),
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

data_prep_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

X_train = data_prep_pipeline.fit_transform(X_train)
X_validation = data_prep_pipeline.transform(X_valid) 
X_test = data_prep_pipeline.transform(X_test)
In [ ]:
# # define undersample strategy
# random_undersampler = RandomUnderSampler(sampling_strategy='majority')
# # fit and apply the transform
# X_train, y_train = random_undersampler.fit_resample(X_train, y_train)
In [ ]:
# # Apply SMOTE
# sm = SMOTE(random_state=42)
# X_train, y_train = sm.fit_resample(X_train,y_train)

Multi-Layer Perceptron (MLP)¶

  • Multi-Layer Perceptron (MLP) is being utilized to perform binary classification analysis. It is composed of an input layer, one or more hidden layers, and an output layer. The data is fed through the input layer, and the hidden layers perform nonlinear transformations to extract essential features, while the output layer generates the final output of the model.

  • The data to be processed is received by the input layer, and then it is processed through a sequence of nonlinear transformations in the hidden layers in order to extract important features. Each hidden layer is composed of a set of neurons, where every neuron accepts inputs from the prior layer and generates an output based on an activation function. The output layer generates the ultimate output of the model.

  • Here, we are creating a linear layer that takes the number of neurons in the previous layer as input and number of neurons in the current layer as output. This linear layer applies a linear transformation to the input and produces an output tensor by multiplying the input tensor with a weight matrix and adding a bias vector.


  • ReLU stands for Rectified Linear Unit, and it is an activation function used in deep learning models. It applies a non-linear function to the output of each neuron in a neural network. The ReLU function returns the input value if it is positive, and returns zero if the input value is negative. In other words, ReLU introduces non-linearity to the network, helping it to learn more complex patterns in the data. Here, we are adding a ReLU activation function after each linear layer.
$$ max(0.0, x) $$

relu

  • Batch normalization (BatchNorm1d) is a technique to normalize the inputs of each layer, which is usually done between linear and non-linear transformations. It rescales the activations and stabilizes the learning process by reducing the internal covariate shift. In this implementation, we are applying BatchNorm1d to the output of each linear layer to standardize the inputs to the activation function of the next layer, which helps improve the training speed and stability of the model.

  • Dropout is a regularization technique that is used to prevent overfitting in neural networks. During training, some random neurons are turned off or “dropped out” with a specified probability. This helps in preventing the network from relying too much on a few neurons and promotes the learning of more robust features. We set the dropout rate to 0.5, meaning that 50% of the neurons are dropped out at each iteration.

  • Sigmoid activation function is used in neural networks to introduce non-linearity in the output of a neuron. It maps any input value to a value between 0 and 1, which can be interpreted as a probability. It is commonly used in binary classification problems, where the output is interpreted as the probability of belonging to the positive class.


$$ 1.0 / (1.0 + e^-x) $$

sigmoid

  • Binary cross entropy is a commonly used loss function in Multilayer Perceptron (MLP) models for binary classification tasks. It measures the difference between predicted and true binary labels and provides a measure of how well the model is performing in terms of accuracy.
$$ BinaryCrossEntropy = H_p(q) = -\frac{1}{N}\sum_{i=1}^n y_i * log(p(y_i))+ (1-y_i) * log(1-p(y_i)) + \frac{\lambda}{2}||{\theta}||^2 $$
In [ ]:
# Define the model architecture
class BinaryClassifier(nn.Module):
    def __init__(self, input_size):
        super(BinaryClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 16)
        self.fc4 = nn.Linear(16, 1)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
        self.dropout = nn.Dropout(p=0.2)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.relu(self.fc3(x))
        x = self.dropout(x)
        x = self.sigmoid(self.fc4(x))
        return x

# Define hyperparameters
batch_size = 128
learning_rate = 0.001

# Convert data to PyTorch tensors and create DataLoader
X_train_tensor = torch.from_numpy(X_train).float()
X_valid_tensor = torch.from_numpy(X_validation).float()
X_test_tensor = torch.from_numpy(X_test).float()
y_train_tensor = torch.from_numpy(y_train).float()
y_valid_tensor = torch.from_numpy(y_validation).float()
y_test_tensor = torch.from_numpy(y_test).float()

train_ds = TensorDataset(X_train_tensor, y_train_tensor)
valid_ds = TensorDataset(X_valid_tensor, y_valid_tensor)
test_ds = TensorDataset(X_test_tensor, y_test_tensor)



train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_ds, batch_size=batch_size)
test_loader = DataLoader(test_ds, batch_size=batch_size)

# Initialize the model and optimizer
model = BinaryClassifier(X_train.shape[1])
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
In [ ]:
model
Out[ ]:
BinaryClassifier(
  (fc1): Linear(in_features=199, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=32, bias=True)
  (fc3): Linear(in_features=32, out_features=16, bias=True)
  (fc4): Linear(in_features=16, out_features=1, bias=True)
  (relu): ReLU()
  (sigmoid): Sigmoid()
  (dropout): Dropout(p=0.2, inplace=False)
)
In [ ]:
time1=time.time()
epochs = 20
# Train the model
try:
      for epoch in range(epochs):
          train_loss = 0
          train_acc = 0
          train_auc=0
          valid_loss = 0
          valid_acc = 0
          valid_auc=0

          pred=[]
          epoch_label=[]

          # Training loop
          model.train()
          count=0
          for X, y in train_loader:
              optimizer.zero_grad()
              y_pred = model(X.float())
              
              loss = criterion(y_pred.squeeze(), y)

              loss.backward()
              optimizer.step()
              y_pred=torch.round(y_pred)
              
              train_loss += loss.item()
              train_acc += (y_pred.float()  == y).float().mean().item()
              train_auc += roc_auc_score(y.tolist(), y_pred.tolist())
            
              count += 1

          train_loss = np.round(train_loss/count, 3)
          train_acc=np.round(train_acc/count,3)
          train_auc=np.round(train_auc/count,3)
          writer.add_scalar('Training Auc',train_auc,epoch)
          writer.add_scalar('Training Loss',train_loss,epoch)
          #Validation loop
          model.eval()
          count=0
          with torch.no_grad():
              for X, y in valid_loader:
                  y_pred = model(X)
                  
                  loss = criterion(y_pred.squeeze(), y)
                  y_pred=torch.round(y_pred)
                  valid_loss += loss.item()
                  valid_acc += (y_pred.float()== y).float().mean().item()
                  valid_auc += roc_auc_score(y.tolist(), y_pred.tolist())
                  count += 1

          valid_loss = np.round(valid_loss/count, 3)
          valid_acc=np.round(valid_acc/count,3)
          valid_auc=np.round(valid_auc/count,3)

          writer.add_scalar('Validation Auc',valid_auc,epoch)
        
          
          # Print the results for this epoch
          print(f'Epoch {epoch + 1}, Train Loss: {train_loss}, '
                f'Train AUC: {train_auc}, '
                f'Valid Loss: {valid_loss}, '
                f'Valid AUC: {valid_auc}')
      total_time=time.time()-time1  
      #print((epoch,loss, accuracy, auc))
      # Set the model to evaluation mode
      model.eval()


      y_pred = []
      y_true = []

      # Iterate over the test set and predict the probabilities and true labels
      with torch.no_grad():
          for inputs, labels in test_loader:
              outputs = model(inputs)
              y_pred.extend(outputs.squeeze())
              y_true.extend(labels.numpy())

      # Calculate the ROC AUC score
      test_auc = roc_auc_score(y_true, y_pred)
except:
  pass
Epoch 1, Train Loss: 0.273, Train AUC: 0.5, Valid Loss: 0.258, Valid AUC: 0.5
Epoch 2, Train Loss: 0.261, Train AUC: 0.504, Valid Loss: 0.256, Valid AUC: 0.505
Epoch 3, Train Loss: 0.257, Train AUC: 0.516, Valid Loss: 0.256, Valid AUC: 0.515
Epoch 4, Train Loss: 0.253, Train AUC: 0.529, Valid Loss: 0.25, Valid AUC: 0.529
Epoch 5, Train Loss: 0.249, Train AUC: 0.542, Valid Loss: 0.25, Valid AUC: 0.531
Epoch 6, Train Loss: 0.246, Train AUC: 0.545, Valid Loss: 0.248, Valid AUC: 0.536
Epoch 7, Train Loss: 0.244, Train AUC: 0.549, Valid Loss: 0.248, Valid AUC: 0.541
Epoch 8, Train Loss: 0.242, Train AUC: 0.555, Valid Loss: 0.251, Valid AUC: 0.545
Epoch 9, Train Loss: 0.241, Train AUC: 0.558, Valid Loss: 0.251, Valid AUC: 0.548
Epoch 10, Train Loss: 0.239, Train AUC: 0.562, Valid Loss: 0.25, Valid AUC: 0.547
Epoch 11, Train Loss: 0.237, Train AUC: 0.565, Valid Loss: 0.254, Valid AUC: 0.542
Epoch 12, Train Loss: 0.236, Train AUC: 0.567, Valid Loss: 0.253, Valid AUC: 0.553
Epoch 13, Train Loss: 0.235, Train AUC: 0.571, Valid Loss: 0.254, Valid AUC: 0.545
Epoch 14, Train Loss: 0.233, Train AUC: 0.573, Valid Loss: 0.254, Valid AUC: 0.548
Epoch 15, Train Loss: 0.232, Train AUC: 0.577, Valid Loss: 0.256, Valid AUC: 0.547
Epoch 16, Train Loss: 0.231, Train AUC: 0.577, Valid Loss: 0.259, Valid AUC: 0.551
Epoch 17, Train Loss: 0.229, Train AUC: 0.584, Valid Loss: 0.26, Valid AUC: 0.557
Epoch 18, Train Loss: 0.229, Train AUC: 0.585, Valid Loss: 0.261, Valid AUC: 0.558
Epoch 19, Train Loss: 0.227, Train AUC: 0.588, Valid Loss: 0.266, Valid AUC: 0.559
Epoch 20, Train Loss: 0.226, Train AUC: 0.592, Valid Loss: 0.261, Valid AUC: 0.552
In [ ]:
try: hcdrLog 
except : hcdrLog = pd.DataFrame(
    columns=[
        "Architecture string", 
        "Optimizer", 
        "Epochs", 
        "Train AUC",
        "Valid Auc",
        "Test AUC",
        "Total experiment time"
    ]
)
In [ ]:
hcdrLog
Out[ ]:
Architecture string Optimizer Epochs Train AUC Valid Auc Test AUC Total experiment time
In [ ]:
hcdrLog.loc[len(hcdrLog)] = [
    '199-relu-64-relu-32-relu-16-dropout-sigmoid-1', 
    "adam", 
    f"{epochs}", 
    f"{train_auc*100}%",
    f"{valid_auc*100}%",
    f"{test_auc*100}%",
    f"{total_time:8.2f} secs",

]

hcdrLog
Out[ ]:
Architecture string Optimizer Epochs Train AUC Valid Auc Test AUC Total experiment time
0 199-relu-32-relu-16-relu-8-sigmoid-1 adam 50 76.2% 67.10000000000001% 73.14872195693694% 487.11 secs
1 199-relu-64-relu-32-relu-16-sigmoid-1 adam 50 62.8% 55.400000000000006% 71.84095564070823% 433.81 secs
2 199-relu-64-relu-32-relu-16-dropout-sigmoid-1 adam 10 56.3% 55.00000000000001% 76.73255313226835% 203.19 secs
In [ ]:
%tensorboard --logdir=runs/exp4 --host=127.0.0.3

tensorboard

Submission File Prep¶

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.
In [ ]:
X_kaggle_test= data_prep_pipeline.fit_transform(X_kaggle_test)

# Set the model to evaluation mode
model.eval()

# Convert the Kaggle test data to a tensor and move it to the device
X_kaggle_test_tensor = torch.from_numpy(X_kaggle_test).float()

# Generate predictions for the Kaggle test set
with torch.no_grad():
    kaggle_preds = model(X_kaggle_test_tensor)
    
# Extract the predicted probabilities for the positive class
kaggle_preds = kaggle_preds.numpy()
In [ ]:
dataset=pd.read_csv("/content/drive/MyDrive/AML/application_test.csv")
In [ ]:
# Submission dataframe
submit_df = dataset[['SK_ID_CURR']]
submit_df['TARGET'] = kaggle_preds

submit_df.head()
Out[ ]:
SK_ID_CURR TARGET
0 100001 0.044793
1 100005 0.122536
2 100013 0.072631
3 100028 0.062897
4 100038 0.984631
In [ ]:
submit_df.to_csv("MLP_submission5.csv",index=False)

Kaggle submission via the command line API¶

In [ ]:
! kaggle competitions submit -c home-credit-default-risk -f MLP_submission5.csv -m "MLP_submission5"
100% 876k/876k [00:03<00:00, 254kB/s]
Successfully submitted to Home Credit Default Risk

report submission¶

Click on this link

Home Credit Default Risk (HCDR)

Final Project for Spring 2023 course: CSCI-P 556 - Applied Machine Learning

Write-up

Abstract¶

The objective of the Home Credit Default Risk project is to assess a client's ability to repay a loan based on a collection of features present in the dataset.

In phase 1, we identified the problem as a classification task and selected suitable ML models. Phase 2 involved EDA and creating baseline models with four different algorithms. In Phase 3, we conducted additional feature engineering to identify the most effective features and used GridSearch to further refine the accuracy of our models. We observed that the XGBoost produced a Kaggle score of 0.736.

During this phase, our aim was to implement Multilayer Perceptron architectures to analyze our data and evaluate the performance based on the cross-entropy loss function. We utilized the ReLU activation function with the Adam optimizer and performed 50 epochs using 3 hidden layers and 1 output features. Additionally, we utilized TensorBoard to visualize our accuracy and loss. As a result, we achieved a test ROC of 0.76, with a Kaggle submission public score of 0.728.

Project Description (data and tasks)¶

Data description:

  1. application_{train|test}.csv
  • The primary table consists of the necessary information regarding home credit loan applications. This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET). Static data for all applications. One row represents one loan in our data sample.
  1. bureau.csv
  • This is a record of how well the client has paid back loans and borrowed money in the past, as reported by other banks or lenders. All client's previous credits provided by other financial institutions were reported to the Credit Bureau (for clients who have a loan in our sample). For every loan in our sample, there are as many rows as the number of credits the client had in the Credit Bureau before the application date.
  1. bureau_balance.csv
  • This has information about the amounts you owed and paid back each month in the past, which is recorded by a company that keeps track of people's credit histories. This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows.
  1. POS_CASH_balance.csv
  • This table shows how much money the client owes to Home Credit every month from their previous purchases made using credit or cash loans. This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows.
  1. credit_card_balance.csv
  • This table shows how much money the applicant owed on their Home Credit credit cards in past months. This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows.
  1. previous_application.csv
  • All previous applications for Home Credit loans of clients who have loans in our sample. There is one row for each previous application related to loans in our data sample.
  1. installments_payments.csv
  • This shows whether or not you've paid back loans you got from Home Credit in the past. There is a) one row for every payment that was made plus b) one row each for missed payment. One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.

dataset


Task to be tackled:
We handled the following task in this phase:

  1. Firstly, we implemented a MultiLayer Perceptron (MLP) model, which is a neural network architecture consisting of an input layer, one or more hidden layers, and an output layer.
  2. We then performed experiments using various optimizer functions and activation functions to optimize the performance of our MLP model.
  3. Furthermore, we experimented with different numbers of hidden layers and neurons in each layer to find the most appropriate architecture for our dataset.
  4. After that, we evaluated the accuracy of each model and identified the highest performing one. By performing these tasks, we were able to build a robust MLP model that accurately predicted the binary classification of our dataset.


Workflow Diagram: Gantt_Chart_1

Gantt_Chart_1

Gantt_Chart_3

Team and Plan updates¶

Team name: FP_Group22

Team members:

  1. Aarushi Dua (aarudua@iu.edu)
  2. Sai Teja Burla (saburla@iu.edu)
  3. Lakshay Madaan (lmadaan@iu.edu)
  4. Shyam Makwana (smakwana@iu.edu)

Phase leadership plan:

Phase

Phase Leader

Tasks

Phase 1

Shyam Makwana

Project proposal, Layout, Notebook, Credit assignment plan, Gantt chart, Dataset

Phase 2

Aarushi Dua

Grab data, EDA, Metrics, Baseline models, Baseline pipeline, Brief report, Presentation and script

Phase 3

Sai Teja Burla

Feature engineering, Hyperparameter tuning, Additional feature selection, ensemble methods, Presentation and script

Phase 4

Lakshay Madaan

Neural Network, Advanced model architectures, Loss functions, Final project presentation and report



Credit assignment plan:

Phase

Member Name

Task Name

Description














Phase 1

Aarushi Dua

Sai Teja Burla

Shyam Makwana

Lakshay Madaan

Problem Understanding

Comprehending the problem and understanding the provided data 

Shyam Makwana



Assigning Tasks

Assigning tasks to each of the team members 

Phase Planning

Planning all the phases

Data Description

Describing what information each of the given csv files have with a sample of their content 

Sai Teja Burla

Abstract

Brief description of the problem statement at hand

Researched ML models and metric

Description of ML Models and Evaluation Metrics 

Block Diagram

Visual representation of the pipeline

Aarushi Dua

Model Pipeline

Explained the steps to perform during the course of the project

Gantt Chart

Summarize the timeline of the project using gantt charts

Aarushi Dua

Sai Teja Burla

Credit Assignment Plan

Design the four phases and describe the task required.

Lakshay Madaan

Environmental Setup

Connected Data with docker

Data Preparation

Checked Dataset are loading and performing basic data visualization.







Phase 2

Aarushi Dua

Sai Teja Burla

Data Visualization and EDA

Visualize the dataset to gather insights and perform different data pre-processing techniques.

Lakshay Madaan

Shyam Makwana

Feature Engineering and Pipeline Creation

Data Transformation is required and will add/remove necessary features. 

Along with creating a pipeline for numeric and categorical features.

Aarushi Dua

Model Training

Fit the training data on various ML algorithms using pipelines

Sai Teja Burla

Model Evaluation

Apply different evaluation metrics to get the accuracy and loss

Shyam Makwana

Research about different Hyperparameters

Gather hyperparameters of each model applied into the GridsearchCV

Lakshay Madaan

Comparing Baseline Models

Compare the results of each baseline model.





Phase 3

Sai Teja Burla

Feature Importance

Figure out the importance of each features used

Aarushi Dua

Tuning Hyperparameters

Hyperparameter tuning is performed again to improve the performance.

Shyam Makhwana

Visualizing different algorithm’s results

Plot bars and charts to display the results of each model

Lakshay Madaan

Feature Engineering

Any further manipulation in features if required




Phase 4

Lakshay Madaan

Final Model Evaluation

Perform final model evaluation on the best model achieved.

Shyam Makhwana

Report Writing and Video Editing

Write report for final submission and record the video presentation

Sai Teja Burla

Aarushi Dua

Adding Multilayer Perceptron

Introduce advance ML model; i.e Pytorch model for further improvement in model’s accuracy

Neural Network¶

  • The below architecture is used primarily to experiment with different batch sizes, epochs, and the number of nodes in hidden layers.

neural_network

neural_network_1

Data Leakage¶

  • It is important to make sure that the data is free from any mistakes or assumptions that can impact the EDA, our interpretation of the model, and the actual model that we create. Leakage happens when data is present during training but not during testing and can be a concern when dealing with missing values. But we were able to handle the missing values during feature engineering.
  • There are many cardinal sins of machine learning, such as poor data quality, improper data splitting leading to overfitting, relying on only one scoring metric, overinterpreting results, and confusion due to excessive cross-validation. To ensure that these concerns were addressed, we made sure that the data was well structured and missing values were handled properly; overfitting was avoided by properly splitting the data; multiple scoring metrics were used for evaluation; we tabulated our results to avoid over-interpretation; and multiple cross-validations were used to avoid confusion.
  • In addition to the above points, we ensured that proper documentation and version control was done to avoid any mistakes in reproducing the results, the right model and hyperparameters were chosen, and the model was validated on diverse datasets to further improve the robustness of the model.

Gap Analysis¶

When we looked into the leaderboard we observed that the other groups have implemented LGBM which might have played a key role in their public kaggle score being higher than ours. We also saw that they made batch normalization layers in their MLP model which could have also helped with the same. Added to this their initial number of features are more than ours. There could also be a difference in our preprocessing methodologies.

Modeling Pipelines¶

A visualization of the modeling pipelines and subpipelines: Block Diagram


Machine Learning Pipelines:
The model that we are aiming to build should have the capability to classify the target variable into two classes, i.e. 1 or 0 which are the numeric representation for whether the person who took out a loan will repay it or not based on all the data provided to us in the dataset. For achieving this with accurate results we are planning on trying the following algorithms to see which one provides us with the best outcome:

  1. Logistic Regression - This is a form of statistical model that is used to examine the correlation between a binary or categorical dependent variable and one or more independent variables. It employs a logistic function to simulate the likelihood that the dependent variable will fall into a certain category. The logistic function converts any input to a number between 0 and 1, which represents the likelihood of the binary result.
$$ \text { Log Loss } = -\frac{1}{N}\sum_{i=1}^n y_i * log(p(y_i))+ (1-y_i) * log(1-p(y_i)) + \frac{\lambda}{2}||{\theta}||^2 $$
  1. Gaussian Naive Bayes - It is a variant of the Naive Bayes algorithm for classification tasks in machine learning. It is predicated on the Bayes theorem and the supposition of data feature independence. The input characteristics in Gaussian Naive Bayes are thought to have a Gaussian or normal distribution. With the input data, the algorithm determines the likelihood of each class, and then it chooses the class with the highest probability as the anticipated output.

  2. Decision Tree Classification - In this model, based on the characteristics of the input data, a tree-like model of decisions and their potential outcomes is built. The method begins with the complete dataset and chooses the optimal feature to divide the data according to a certain criterion, such Gini Impurity or Information Gain. The data is then divided into subsets depending on the selected feature and the procedure is then performed iteratively for each subset until all the data in each subset belong to the same class, or a stopping requirement is satisfied.

  3. Random Forest Classification - This model builds numerous decision trees using subsets of the input data and randomly choosing features for each split. Each decision tree only sees a subset of the data since they are trained on a bootstrapped sample of the data. The decision trees' majority vote determines the algorithm's final result.

  4. Neural Networks - Neural networks are a class of models that learn by adjusting the weights of the input features to generate a desired output. The input features are fed into a network of interconnected neurons, which transform the input through a series of mathematical operations. These transformations are governed by a set of weights that are learned during training. The output is then compared to the actual target output and the weights are updated using a loss function, such as cross-entropy loss.

$$ BinaryCrossEntropy = H_p(q) = -\frac{1}{N}\sum_{i=1}^n y_i * log(p(y_i))+ (1-y_i) * log(1-p(y_i)) + \frac{\lambda}{2}||{\theta}||^2 $$


Description of metrics and analysis:
The evaluation metrics that we want to use to check whether our model is performing well or not are as follows:

  1. Confusion Matrix - A confusion matrix is a table that is frequently used to assess the effectiveness of a classification model. It is a matrix that compares the predicted labels of the model with the actual labels of the data.
  • The confusion matrix consists of the following cells:

True Positive (TP)

The number of rows that are correctly predicted as positive by the model.

True Negative (TN) 

The number of rows that are correctly predicted as negative by the model.

False Positive (FP)

The number of rows that are incorrectly predicted as positive by the model.

False Negative (FN)

The number of rows that are incorrectly predicted as negative by the model.

  1. Classification Report - An evaluation of a classification model's performance is summarized in a classification report, which is commonly produced using metrics derived from a confusion matrix. A classification report generally includes the following metrics:
    • Precision - The ratio of true positives among all the instances that the model classified as positive.

$$ Precision = \frac{TP}{TP + FP} $$
- **Recall** - The ratio of true positives among all the instances that actually belong to the positive class of the model.

$$ Recall = \frac{TP}{TP + FN} $$
- **F1-Score** - The harmonic mean of precision and recall.

$$ F1 = \frac{2 * Precision * Recall}{Precision + Recall} = \frac{2*TP}{2*TP+FP+FN} $$
  1. Accuracy Score - A typical performance metric in machine learning for assessing the accuracy of a classification model is the accuracy score. It is defined as the ratio of correctly classified instances out of the total number of instances in the dataset.

$$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$
  1. ROC Curve - An ROC curve is a graphical representation of the performance of a binary classification model at different classification thresholds. In machine learning, it is frequently used to assess and contrast the effectiveness of various categorization models. The True Positive Rate and False Positive Rate are plotted on the y-axis and x-axis respectively.

Experimental results of phase 3¶

index exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train time Test time Best Params Description
0 Baseline Logisitc Regression-82 input features 91.32% 91.35% 91.28% 76.09% 76.11% 76.16% 10.06 secs 0.50 secs {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False} Joined Feature Aggregated in input features
1 Baseline Decision Tree-82 input features 100.00% 85.41% 85.33% 100.00% 56.18% 56.81% 34.50 secs 0.46 secs {'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': 42, 'splitter': 'best'} Joined Feature Aggregated in input features
2 Baseline Random Forest-82 input features 100.00% 91.91% 91.94% 100.00% 74.77% 74.38% 157.52 secs 3.05 secs {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False} Joined Feature Aggregated in input features
3 Baseline GaussianNB-82 input features 28.80% 28.65% 28.57% 64.62% 64.94% 64.18% 2.54 secs 0.53 secs {'priors': None, 'var_smoothing': 1e-09} Joined Feature Aggregated in input features
4 Baseline XGBboost-82 input features 92.77% 91.93% 91.95% 89.09% 77.70% 77.70% 199.03 secs 0.58 secs {'objective': 'binary:logistic', 'use_label_encoder': None, 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': None, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'feature_types': None, 'gamma': None, 'gpu_id': None, 'grow_policy': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': None, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 'max_depth': None, 'max_leaves': None, 'min_child_weight': None, 'missing': nan, 'monotone_constraints': None, 'n_estimators': 100, 'n_jobs': None, 'num_parallel_tree': None, 'predictor': None, 'random_state': None, 'reg_alpha': None, 'reg_lambda': None, 'sampling_method': None, 'scale_pos_weight': None, 'subsample': None, 'tree_method': None, 'validate_parameters': None, 'verbosity': None} Joined Feature Aggregated in input features
5 Tuned Logisitc Regression-82 input features 91.32% 91.34% 91.29% 76.03% 76.09% 76.14% 56.75 secs 0.44 secs {'C': 0.01, 'penalty': 'l2'} Joined Feature Aggregated+Hyerparameters
6 Tuned Decision Tree-82 input features 91.33% 91.33% 91.34% 67.35% 67.07% 66.84% 81.10 secs 0.61 secs {'max_depth': 2, 'min_samples_leaf': 2, 'min_samples_split': 2} Joined Feature Aggregated+Hyerparameters
7 Tuned Random Forest Trees-82 input features 91.80% 91.76% 91.76% 81.64% 74.13% 73.91% 282.92 secs 0.68 secs {'max_depth': 10, 'min_samples_leaf': 2, 'n_estimators': 10} Joined Feature Aggregated+Hyerparameters

Discussion¶

While building the MLP, we had the following discussion about its architecture before deciding on best model.

  1. As a preprocessing technique we implemented both Random UnderSampling as well as Smote for over sampling to see which could help make a better model but noticed that both of these were not helping with making a better model than the ones we already have so we decided to go with a simple pipeline that we implemented before for our regular models.
  2. We used BCELoss as the loss function since we are dealing with binary target variables.
  3. For the optimizers we made use of SGD and Adam but as a part of our initial experiment phase we saw that SGD was not helping at all so we ended up showcasing only Adam as our main optimizer.
  4. We also experimented with learning rate values 0.01 and 0.001 and noticed that 0.001 always gives us better results.
  5. We used different batch sizes to experiment and noticed that lesser batch size is giving us better results.
  6. Coming to the architecture, we experimented with 2 or 3 hidden layers in each of our experiments.
  7. Weight decay was also experimented with but the accuracy kept remaining the same throughout the epochs due to which we decided not to use it.

Conclusion¶

  • In conclusion, our project focused on predicting the risk of loan defaults for Home Credit using machine learning techniques. This is an important problem as loan defaults can have a significant impact on financial institutions and their clients.

  • In phase 2, we conducted EDA to identify significant features and created ML pipelines using various classifiers. Underfitting was observed in Gaussian Naive Bayes and overfitting in Decision Tree and Random Forest. In the next phase, advanced feature engineering and hyperparameter tuning will be implemented to enhance model evaluation metrics.

  • In phase 3, we explored, cleaned, and engineered features for both categorical and numerical data, trained multiple models, and optimized their hyperparameters. Our best model achieved a ROC AUC score of 0.78 on the test data, outperforming the baseline. Important features for predicting loan defaults were past payment history, income, credit bureau reports, and external sources. In the future, we plan to implement a new multitask loss function in PyTorch to predict both the class and length of time before defaulting. Our findings and model could have practical implications for mitigating the risk of loan defaults in financial institutions.

  • In this phase, we built a neural network with 199 input features, 3 hidden layers (64, 32, 16 neurons), and 1 output features, using the ReLU activation function on the hidden layers and the Sigmoid activation function on the output layer and Adam Optimizer. Cross Entropy Loss Function was used for evaluation. The MLP model had a test roc score of 0.76, and a Kaggle submission public score of 0.728. However, the AUC score of the non-deep learning models (Logistic Regression, Random Forest, and XGBoost) was better. The MLP model can be improved using embeddings or advanced classification models like boosting.

Kaggle Submission¶

Kaggle submission

Kaggle submission 1

References¶

Some of the material in this notebook has been adopted from here

  1. https://www.kaggle.com/c/home-credit-default-risk
  2. https://matplotlib.org/
  3. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
  4. https://machinelearningmastery.com/choose-an-activation-function-for-deep-learning/

TODO: Predicting Loan Repayment with Automated Feature Engineering in Featuretools¶

Read the following:

  • feature engineering via Featuretools library:
    • https://github.com/Featuretools/predict-loan-repayment/blob/master/Automated%20Loan%20Repayment.ipynb
  • https://www.analyticsvidhya.com/blog/2018/08/guide-automated-feature-engineering-featuretools-python/
  • feature engineering paper: https://dai.lids.mit.edu/wp-content/uploads/2017/10/DSAA_DSM_2015.pdf
  • https://www.analyticsvidhya.com/blog/2017/08/catboost-automated-categorical-data/